[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636307#comment-16636307 ] Geoff Freeman commented on SPARK-15689: --- Thanks Wenchen. I was hoping that we might be able to somehow plug into the bucketing that Spark is doing. I'll keep looking into ways to plug in. > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Major > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634650#comment-16634650 ] Geoff Freeman commented on SPARK-15689: --- I'm having trouble figuring out how to expose a custom hash function from my DataSourceV2. I'm trying to implement SupportsReportPartitioning, but I don't see how I can convert the physical.Partitioning that's required by outputPartitioning() into a HashPartitioning. I'd like to figure out how to pass this to DataSourceScanExec so that we can avoid shuffles. [~rxin] [~cloud_fan] Thanks! > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Major > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634650#comment-16634650 ] Geoff Freeman edited comment on SPARK-15689 at 10/1/18 9:24 PM: I'm having trouble figuring out how to expose a custom hash function from my DataSourceV2. I'm trying to implement SupportsReportPartitioning, but I don't see how I can convert the physical.Partitioning that's required by outputPartitioning() into a HashPartitioning. I'd like to figure out how to pass this to DataSourceScanExec so that we can avoid shuffles. Are there any examples of where it's been implemented that I could look at? [~rxin] [~cloud_fan] Thanks! was (Author: gfreeman): I'm having trouble figuring out how to expose a custom hash function from my DataSourceV2. I'm trying to implement SupportsReportPartitioning, but I don't see how I can convert the physical.Partitioning that's required by outputPartitioning() into a HashPartitioning. I'd like to figure out how to pass this to DataSourceScanExec so that we can avoid shuffles. [~rxin] [~cloud_fan] Thanks! > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Wenchen Fan >Priority: Major > Labels: SPIP, releasenotes > Fix For: 2.3.0 > > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org