[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130355#comment-16130355 ]
Russell Spitzer commented on SPARK-15689: ----------------------------------------- Thanks [~cloud_fan] for posting the design doc it was a great read and I like a lot of the direction this is going in. It would helpful if we could have access to the doc as a google doc or some other editable/comment-able form though to encourage discussion. I left some comments on the prototype but one thing I think could be a great addition would be a joinInterface. I ended up writing up one of these specifically for Cassandra and had to do a lot of plumbing to get it to fit into the rest of the Catalyst ecosystem so I think this would be a great time to plan ahead in Spark design. The join interface would look a lot like a combination of the read and write apis, given a row input and a set of expressions the relationship should return rows that match those expressions OR fallback to just being a read relationship if none of the expressions can be satisfied by the join (leaving all the expressions to be evaluated in spark). > Data source API v2 > ------------------ > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Labels: releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org