The more cores you have, the less memory they will get.
512M is already quite small, and if you have 4 cores it will mean
roughly 128M per task.
Sometimes it is interesting to have less cores and more memory.
how many cores do you have ?
André
On 2014-08-05 16:43, Grzegorz Białek wrote:
Hi,
I
We never saw your exception when reading bzip2 files with spark.
But when we wrongly compiled spark against older version of hadoop (was
default in spark), we ended up with sequential reading of bzip2 file,
not taking advantage of block splits to work in parallel.
Once we compiled spark with SPAR
Good question, I am wondering too how it is possible to add a line
number to distributed data.
I thought it was a job for maptPartionsWithIndex, but it seems difficult.
Something similar here :
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html#a995
Maybe at the fil
The data partitionning is done by default *according to the number of
HDFS blocks* of the source.
You can change the partitionning with .repartion, either to increase or
decrease the level of parallelism :
val recordsRDD =
SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256)
val re
It could be GC issue, first time it triggers a full GC that takes too
much time ?
Make sure you have Xms and Xms at the same values, and try
-XX:+UseConcMarkSweepGC
And analyse GC logs.
André Bois-Crettez
On 2014-04-16 16:44, Arpit Tak wrote:
I am loading some data(25GB) in shark from hdfs : sp
Seem you have not enough memory on the spark driver. Hints below :
On 2014-04-15 12:10, Qin Wei wrote:
val resourcesRDD = jsonRDD.map(arg =>
arg.get("rid").toString.toLong).distinct
// the program crashes at this line of code
val bcResources = sc.broadcast(resourcesRDD.collect.to