[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

2015-04-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518305#comment-14518305
 ] 

Apache Spark commented on SPARK-5182:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5526

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

2015-02-09 Thread Dan Osipov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14313168#comment-14313168
 ] 

Dan Osipov commented on SPARK-5182:
---

[~marmbrus] The patch in PR 4308 appears to only add partitioning support for 
Parquet data sources. This task is more generic, and I'd like to take advantage 
of the features described for JSON sources.
Can you reopen or break out new tasks for remaining features?

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301196#comment-14301196
 ] 

Apache Spark commented on SPARK-5182:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4308

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

2015-01-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282682#comment-14282682
 ] 

Yin Huai commented on SPARK-5182:
-

[~btiernay] This feature is targeted for both. 

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

2015-01-18 Thread Bob Tiernay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281841#comment-14281841
 ] 

Bob Tiernay commented on SPARK-5182:


Is this intended for {{SQLContext}}, {{HiveContext}} or both?

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

2015-01-09 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272002#comment-14272002
 ] 

Yin Huai commented on SPARK-5182:
-

Here is the doc from [~marmbrus].

Partitioning data by one or more columns is a very important optimization for 
many analytic workloads.  Right now, the implementation of partitioning in the 
Data Sources API suffers from several shortcomings.
First, each data source must implement the support on its own leading to code 
duplication.  This duplication applies both to the code of discovering / 
cataloging partitions, but also to the code required to evaluate predicates 
against a given partitions. 
Second, only a limited set of predicates are pushed down and so partitioning 
misses opportunities to prune.  While we can continue to expand the set of 
filters, however, this does not solve the problem that each data source would 
still need to implement its own version of expression evaluation for each new 
(Filter x DataType).

Requirements for the new API:
* Built in support for telling a data source which partitions it should read 
based on arbitrary predicates (including things like UDFS).
* Support for multiple levels of nested directories that store data based on 
partitioning attributes (e.g, /table/col1=a/col2=b).
* Rapid auto-discovery of large numbers of partitions.
* Discovery of partition column types using schema inference similar to JSON.
* Support for user defined partitioning schemes? (i.e. /table/2001/02/03)

Proposed interface:
{code}
case class Partition(values: Row, path: String)
case class PartitionSpec(
partitionColumns: StructType, 
partitions: Array[Partition])

class PartitionedRelation {
  // Has default implementation
  def parsePartitions(paths: Array[String]): PartitionSpec 

  def basePath: String

  def buildScan(
  partitions: Array[Partition], 
  requiredColumns: Array[String], 
  filters: Array[Filter]): RDD[Row]
}
{code}
Open Questions:
* Is it okay to store all of the partition metadata in-memory initially? Or 
should we consider storing this data locally to something like BDB?
* Should we be using metastore partitioning instead?


 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org