Re: S3A Creating Task Per Byte (pyspark / 1.6.1)

Steve Loughran Fri, 13 May 2016 03:20:07 -0700

On 12 May 2016, at 18:35, Aaron Jackson 
<ajack...@pobox.com<mailto:ajack...@pobox.com>> wrote:


I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in 
s3.  I've done this previously with spark 1.5 with no issue.  Attempting to 
load and count a single file as follows:

dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()

But when it attempts to load, it creates 279K tasks.  When I look at the tasks, 
the # of tasks is identical to the # of bytes in the file.  Has anyone seen 
anything like this or have any ideas why it's getting that granular?

yeah, seen that. The blocksize being returned by the FS is coming back as 0, 
which is then triggering a split on every byte. Which as you have noticed, 
doesn't work

you've hit https://issues.apache.org/jira/browse/HADOOP-11584 , fixed in Hadoop 
2.7.0

You need to consider S3A not usable in production in the 2.6.0 release; things 
surfaced in the field which only got caught later. HADOOP


 https://issues.apache.org/jira/browse/HADOOP-11571 covered the issues that 
surfaced. Stay on S3N for a 2.6-x based release, run to Hadoop 2.7.1+ for S3A 
to be ready to play.

Re: S3A Creating Task Per Byte (pyspark / 1.6.1)

Reply via email to