[ https://issues.apache.org/jira/browse/SPARK-13876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-13876. -------------------------------------- Resolution: Fixed Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/11646 > Strategy for planning scans of files > ------------------------------------ > > Key: SPARK-13876 > URL: https://issues.apache.org/jira/browse/SPARK-13876 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Michael Armbrust > Assignee: Michael Armbrust > Priority: Critical > Fix For: 2.0.0 > > > We should specialize the logic for planning scans over sets of files. > Requirements: > - remove the need to have RDD, broadcastedHadoopConf and other distributed > concerns in the public API of org.apache.spark.sql.sources.FileFormat > - Partition column appending should be delegated to the format to avoid an > extra copy / devectorization when appending partition columns > - Should minimize the amount of data that is shipped to each executor (i.e. > it does not send the whole list of files to every worker in the form of a > hadoop conf) > - should natively support bucketing files into partitions, and thus does not > require coalescing / creating a UnionRDD with the correct partitioning. > - Small files should be automatically coalesced into fewer tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org