Re: master=local vs master=local[*]

2014-08-05 Thread Andre Bois-Crettez
The more cores you have, the less memory they will get. 512M is already quite small, and if you have 4 cores it will mean roughly 128M per task. Sometimes it is interesting to have less cores and more memory. how many cores do you have ? André On 2014-08-05 16:43, Grzegorz Białek wrote: Hi, I

Re: Reading from .bz2 files with Spark

2014-05-16 Thread Andre Bois-Crettez
We never saw your exception when reading bzip2 files with spark. But when we wrongly compiled spark against older version of hadoop (was default in spark), we ended up with sequential reading of bzip2 file, not taking advantage of block splits to work in parallel. Once we compiled spark with SPAR

Re: skip lines in spark

2014-04-23 Thread Andre Bois-Crettez
Good question, I am wondering too how it is possible to add a line number to distributed data. I thought it was a job for maptPartionsWithIndex, but it seems difficult. Something similar here : http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html#a995 Maybe at the fil

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-22 Thread Andre Bois-Crettez
The data partitionning is done by default *according to the number of HDFS blocks* of the source. You can change the partitionning with .repartion, either to increase or decrease the level of parallelism : val recordsRDD = SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) val re

Re: Create cache fails on first time

2014-04-17 Thread Andre Bois-Crettez
It could be GC issue, first time it triggers a full GC that takes too much time ? Make sure you have Xms and Xms at the same values, and try -XX:+UseConcMarkSweepGC And analyse GC logs. André Bois-Crettez On 2014-04-16 16:44, Arpit Tak wrote: I am loading some data(25GB) in shark from hdfs : sp

Re: Spark program thows OutOfMemoryError

2014-04-16 Thread Andre Bois-Crettez
Seem you have not enough memory on the spark driver. Hints below : On 2014-04-15 12:10, Qin Wei wrote: val resourcesRDD = jsonRDD.map(arg => arg.get("rid").toString.toLong).distinct // the program crashes at this line of code val bcResources = sc.broadcast(resourcesRDD.collect.to