Andrei, Ashish,
To be clear, I don't think it's *counting* the entire file. It just seems
from the logging and the timing that it is doing a get of the entire file,
then figures out it only needs some certain blocks, does another get of
only the specific block.
Regarding # partitions - I think I
Can you verify that its reading the entire file on each worker using
network monitoring stats? If it does, that would be a bug in my opinion.
On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote:
Andrei, Ashish,
To be clear, I don't think it's *counting* the entire file. It
Anyone have any thoughts on this? Trying to understand especially #2 if
it's a legit bug or something I'm doing wrong.
- Nitay
Founder CTO
On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote:
I have a simple S3 job to read a text file and do a line count.
Specifically I'm
Err I meant #1 :)
- Nitay
Founder CTO
On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote:
Anyone have any thoughts on this? Trying to understand especially #2 if
it's a legit bug or something I'm doing wrong.
- Nitay
Founder CTO
On Thu, Nov 20, 2014 at 11:54 AM,
Not that I'm professional user of Amazon services, but I have a guess about
your performance issues. From [1], there are two different filesystems over
S3:
- native that behaves just like regular files (schema: s3n)
- block-based that looks more like HDFS (schema: s3)
Since you use s3n in your
Concerning your second question, I believe you try to set number of
partitions with something like this:
rdd = sc.textFile(..., 8)
but things like `textFile()` don't actually take fixed number of
partitions. Instead, they expect *minimal* number of partitions. Since in
your file you have 21
I have a simple S3 job to read a text file and do a line count.
Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
file is about 1.2GB. My setup is standalone spark cluster with 4 workers
each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
hadoop 2.4 (though