I have tried it out to merge the file to one, Spark is now working with RAM as
I've expected.
Unfortunately after doing this there appears another problem. Now Spark running
on YARN is scheduling all the work only to one worker node as a one big job. Is
there some way, how to force Spark and
Ok so the problem was solved, it that the file was gziped and it looks that
Spark does not support direct .gz file distribution to workers.
Thank you very much fro the suggestion to merge the files.
Best regards,
Jan
__
I have
Could you please give me an example or send me a link of how to use Hadoop
CombinedFileInputFormat? It sound very interesting to me and it would probably
save me several hours of my pipeline computation. Merging of the files is
currently the bottleneck in my system.
On Sun, Nov 2, 2014 at 1:35 AM, jan.zi...@centrum.cz wrote:
Hi,
I am using Spark on Yarn, particularly Spark in Python. I am trying to run:
myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json)
How many files do you have? and the average size of each file?
myrdd.getNumPartitions()
I have 3 datasets in all the datasets the average file size is 10-12Kb.
I am able to run my code on the dataset with 70K files, but I am not able to
run it on datasets with 1.1M and 3.8M files.
__
On Sun, Nov 2, 2014 at 1:35 AM,
Hi,
I am using Spark on Yarn, particularly Spark in Python. I am trying to run:
myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json)
myrdd.getNumPartitions()
Unfortunately it seems that Spark tries to load everything to RAM, or at least
after while of running this everything slows down and