Hi spark-user,
I am using spark 1.6 to build reverse index for one month of twitter data
(~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates
50 partitions. I'd like to increase the parallelism by increase the number
of input partitions. Thus, I use textFile(..., 200) to yie
Hi Yuming - I was running into the same issue with larger worker nodes a few
weeks ago.
The way I managed to get around the high GC time, as per the suggestion of
some others, was to break each worker node up into individual workers of
around 10G in size. Divide your cores accordingly.
The other
The official guide may help:
http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
-Xiangrui
On Tue, Mar 17, 2015 at 8:27 AM, jatinpreet wrote:
> Hi,
>
> I am getting very high GC time in my jobs. For smaller/real-time load, this
> becomes a real problem.
>
Hi,
I am getting very high GC time in my jobs. For smaller/real-time load, this
becomes a real problem.
Below are the details of a task I just ran. What could be the cause of such
skewed GC times?
36 26010 SUCCESS PROCESS_LOCAL 2 / Slave1 2015/03/17
11:18:44 20 s
I used the spark1.1
On Wed, Jan 14, 2015 at 2:24 PM, Aaron Davidson wrote:
> What version are you running? I think "spark.shuffle.use.netty" was a
> valid option only in Spark 1.1, where the Netty stuff was strictly
> experimental. Spark 1.2 contains an officially supported and much more
> thoro
What version are you running? I think "spark.shuffle.use.netty" was a valid
option only in Spark 1.1, where the Netty stuff was strictly experimental.
Spark 1.2 contains an officially supported and much more thoroughly tested
version under the property "spark.shuffle.blockTransferService", which is
To confirm, lihu, are you using Spark version 1.2.0 ?
On Tue, Jan 13, 2015 at 9:26 PM, lihu wrote:
> Hi,
> I just test groupByKey method on a 100GB data, the cluster is 20
> machine, each with 125GB RAM.
>
> At first I set conf.set("spark.shuffle.use.netty", "false") and run
> the expe
Hi,
I just test groupByKey method on a 100GB data, the cluster is 20
machine, each with 125GB RAM.
At first I set conf.set("spark.shuffle.use.netty", "false") and run
the experiment, and then I set conf.set("spark.shuffle.use.netty", "true")
again to re-run the experiment, but at the lat