[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518305#comment-14518305 ] Apache Spark commented on SPARK-5182: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5526 Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313168#comment-14313168 ] Dan Osipov commented on SPARK-5182: --- [~marmbrus] The patch in PR 4308 appears to only add partitioning support for Parquet data sources. This task is more generic, and I'd like to take advantage of the features described for JSON sources. Can you reopen or break out new tasks for remaining features? Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301196#comment-14301196 ] Apache Spark commented on SPARK-5182: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4308 Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282682#comment-14282682 ] Yin Huai commented on SPARK-5182: - [~btiernay] This feature is targeted for both. Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281841#comment-14281841 ] Bob Tiernay commented on SPARK-5182: Is this intended for {{SQLContext}}, {{HiveContext}} or both? Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272002#comment-14272002 ] Yin Huai commented on SPARK-5182: - Here is the doc from [~marmbrus]. Partitioning data by one or more columns is a very important optimization for many analytic workloads. Right now, the implementation of partitioning in the Data Sources API suffers from several shortcomings. First, each data source must implement the support on its own leading to code duplication. This duplication applies both to the code of discovering / cataloging partitions, but also to the code required to evaluate predicates against a given partitions. Second, only a limited set of predicates are pushed down and so partitioning misses opportunities to prune. While we can continue to expand the set of filters, however, this does not solve the problem that each data source would still need to implement its own version of expression evaluation for each new (Filter x DataType). Requirements for the new API: * Built in support for telling a data source which partitions it should read based on arbitrary predicates (including things like UDFS). * Support for multiple levels of nested directories that store data based on partitioning attributes (e.g, /table/col1=a/col2=b). * Rapid auto-discovery of large numbers of partitions. * Discovery of partition column types using schema inference similar to JSON. * Support for user defined partitioning schemes? (i.e. /table/2001/02/03) Proposed interface: {code} case class Partition(values: Row, path: String) case class PartitionSpec( partitionColumns: StructType, partitions: Array[Partition]) class PartitionedRelation { // Has default implementation def parsePartitions(paths: Array[String]): PartitionSpec def basePath: String def buildScan( partitions: Array[Partition], requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } {code} Open Questions: * Is it okay to store all of the partition metadata in-memory initially? Or should we consider storing this data locally to something like BDB? * Should we be using metastore partitioning instead? Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org