Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
I have tried it out to merge the file to one, Spark is now working with RAM as I've expected. Unfortunately after doing this there appears another problem. Now Spark running on YARN is scheduling all the work only to one worker node as a one big job. Is there some way, how to force Spark and

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
Ok so the problem was solved, it that the file was gziped and it looks that Spark does not support direct .gz file distribution to workers.  Thank you very much fro the suggestion to merge the files. Best regards, Jan  __ I have

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-05 Thread jan.zikes
Could you please give me an example or send me a link of how to use Hadoop CombinedFileInputFormat? It sound very interesting to me and it would probably save me several hours of my pipeline computation. Merging of the files is currently the bottleneck in my system.

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-03 Thread Davies Liu
On Sun, Nov 2, 2014 at 1:35 AM, jan.zi...@centrum.cz wrote: Hi, I am using Spark on Yarn, particularly Spark in Python. I am trying to run: myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json) How many files do you have? and the average size of each file? myrdd.getNumPartitions()

Re: Spark on Yarn probably trying to load all the data to RAM

2014-11-03 Thread jan.zikes
I have 3 datasets in all the datasets the average file size is 10-12Kb.  I am able to run my code on the dataset with 70K files, but I am not able to run it on datasets with 1.1M and 3.8M files.  __ On Sun, Nov 2, 2014 at 1:35 AM,  

Spark on Yarn probably trying to load all the data to RAM

2014-11-02 Thread jan.zikes
Hi, I am using Spark on Yarn, particularly Spark in Python. I am trying to run: myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json) myrdd.getNumPartitions() Unfortunately it seems that Spark tries to load everything to RAM, or at least after while of running this everything slows down and