[ https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
DB Tsai updated SPARK-10340: ---------------------------- Assignee: Cheolsoo Park > Use S3 bulk listing for S3-backed Hive tables > --------------------------------------------- > > Key: SPARK-10340 > URL: https://issues.apache.org/jira/browse/SPARK-10340 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.4.1, 1.5.0 > Reporter: Cheolsoo Park > Assignee: Cheolsoo Park > > AWS S3 provides bulk listing API. It takes the common prefix of all input > paths as a parameter and returns all the objects whose prefixes start with > the common prefix in blocks of 1000. > Since SPARK-9926 allow us to list multiple partitions all together, we can > significantly speed up input split calculation using S3 bulk listing. This > optimization is particularly useful for queries like {{select * from > partitioned_table limit 10}}. > This is a common optimization for S3. For eg, here is a [blog > post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] > from Qubole on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org