Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/12004 The latest patch embraces the fact that 2.6 is the base hadoop version so the `hadoop-aws` JAR is always pulled in, dependencies set up. One thing to bear in mind here that the [Phase I fixes|https://issues.apache.org/jira/browse/HADOOP-11571] aren't in there, And s3a absolutely must not be used in production, the big killers being: * [HADOOP-11570](https://issues.apache.org/jira/browse/HADOOP-11570) closing the stream reads to the EOF, which means every seek() can read in 2x file size. * [HADOOP-11584](https://issues.apache.org/jira/browse/HADOOP-11584) block size returned in `getFileStatus()` ==0. That is bad because both Pig and Spark use that block size in partitioning, so will split up a file into single byte partitions: 20MB file, 2*10^7 tasks. Each of which will open the file at byte (0), then call seek to offset, then close(). As a result, 2*10e7 * tasks reading 2* 2 2 * 10e7 bytes. This is generally considered "pathologically suboptimal". I've had to modify my downstream tests to recognise when the block size of a file ==0 and skip those tests. s3n will work; in 2.6 it moved to the aws JAR, so reinstate the functionality which was in spark builds against hadoop 2.2-2.5
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org