xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2 URL: https://github.com/apache/spark/pull/25626#issuecomment-531479558 @rdblue , thank you very much for your valuable comments. > Implementations that call into Spark would not share code (or, possibly, behavior) for a standard > SQL operation. I would much rather work with people building these sources to get the implementation into Spark so that everyone can benefit, so that the code gets reviewed, and so that we can test thoroughly. Generally aggree with you. From spark's view point, we don't want add something that cannot be controlled by the framework. However, the essence of spark is RDD, a kind of big volumn data processing framework; what we're working is datasource API, which we want can handle as many sources as possible. But not all of sources can fit into RDD, one of which is JDBC, thus we have to do some compromises: let datasources do that spark is not good at. For the problem that implementations call into spark, we don't want implementations do that, but we cannot probibit(or avoid) it either. We already have a push-down DELETE there. > It seems strange to me to add the push-down API first, when it would be used to proxy UPDATE calls to data stores that already support those operations. I know that other stores could also implement the UPDATE call, but a file-based store would need to do all of the reading, processing, and writing on the driver where this call is made. That's a surprising implementation given that Spark is intended for distributed SQL operations. Yea, it's a problem I didn't considered before. > Adding the push-down API first doesn't necessarily mean that implementations would call into Spark like I'm concerned about. That's why I want to understand your use case. If you really need Spark to push operations to JDBC right away, then it may make sense to add this. But if you don't have an urgent need for it, I think Spark users would be better off designing and building the Spark-based implementation first. I'm not very urgent for the API. The situation I'm facing is, some of our users are using the on-cloud distributed data warehouse as well. They of course don't want to use two interfaces to for spark task over the data warehouse. But if you prefer that we building the spark-based implementation first, I'm OK with that. I wonder, if we finally need this API and we target the two (this API and the "spark-based" implementation) both at spark-3.0, what's the difference which is first? Sorry for this question as I don't know the roadmap of the 3.0 release.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org