Re: Spark S3 Performance

2014-11-24 Thread Nitay Joffe
Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It just seems from the logging and the timing that it is doing a get of the entire file, then figures out it only needs some certain blocks, does another get of only the specific block. Regarding # partitions - I think I

Re: Spark S3 Performance

2014-11-24 Thread Daniil Osipov
Can you verify that its reading the entire file on each worker using network monitoring stats? If it does, that would be a bug in my opinion. On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote: Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It

Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Anyone have any thoughts on this? Trying to understand especially #2 if it's a legit bug or something I'm doing wrong. - Nitay Founder CTO On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote: I have a simple S3 job to read a text file and do a line count. Specifically I'm

Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Err I meant #1 :) - Nitay Founder CTO On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote: Anyone have any thoughts on this? Trying to understand especially #2 if it's a legit bug or something I'm doing wrong. - Nitay Founder CTO On Thu, Nov 20, 2014 at 11:54 AM,

Re: Spark S3 Performance

2014-11-22 Thread Andrei
Not that I'm professional user of Amazon services, but I have a guess about your performance issues. From [1], there are two different filesystems over S3: - native that behaves just like regular files (schema: s3n) - block-based that looks more like HDFS (schema: s3) Since you use s3n in your

Re: Spark S3 Performance

2014-11-22 Thread Andrei
Concerning your second question, I believe you try to set number of partitions with something like this: rdd = sc.textFile(..., 8) but things like `textFile()` don't actually take fixed number of partitions. Instead, they expect *minimal* number of partitions. Since in your file you have 21

Spark S3 Performance

2014-11-20 Thread Nitay Joffe
I have a simple S3 job to read a text file and do a line count. Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The file is about 1.2GB. My setup is standalone spark cluster with 4 workers each with 2 cores / 16GB ram. I'm using branch-1.2 code built against hadoop 2.4 (though