On 12 May 2016, at 18:35, Aaron Jackson <ajack...@pobox.com<mailto:ajack...@pobox.com>> wrote:
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in s3. I've done this previously with spark 1.5 with no issue. Attempting to load and count a single file as follows: dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv') dataFrame.count() But when it attempts to load, it creates 279K tasks. When I look at the tasks, the # of tasks is identical to the # of bytes in the file. Has anyone seen anything like this or have any ideas why it's getting that granular? yeah, seen that. The blocksize being returned by the FS is coming back as 0, which is then triggering a split on every byte. Which as you have noticed, doesn't work you've hit https://issues.apache.org/jira/browse/HADOOP-11584 , fixed in Hadoop 2.7.0 You need to consider S3A not usable in production in the 2.6.0 release; things surfaced in the field which only got caught later. HADOOP https://issues.apache.org/jira/browse/HADOOP-11571 covered the issues that surfaced. Stay on S3N for a 2.6-x based release, run to Hadoop 2.7.1+ for S3A to be ready to play.