[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

GitBox Sat, 14 Sep 2019 06:23:56 -0700

xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-531479558
 
 
   @rdblue , thank you very much for your valuable comments.
   > Implementations that call into Spark would not share code (or, possibly, 
behavior) for a standard > SQL operation. I would much rather work with people 
building these sources to get the implementation into Spark so that everyone 
can benefit, so that the code gets reviewed, and so that we can test thoroughly.
   Generally aggree with you. From spark's view point, we don't want add 
something that cannot be controlled by the framework. However, the essence of 
spark is RDD, a kind of big volumn data processing framework; what we're 
working is datasource API, which we want can handle as many sources as 
possible. But not all of sources can fit into RDD, one of which is JDBC, thus 
we have to do some compromises: let datasources do that spark is not good at. 
For the problem that implementations call into spark, we don't want 
implementations do that, but we cannot probibit(or avoid) it either. We already 
have a push-down DELETE there.
   > It seems strange to me to add the push-down API first, when it would be 
used to proxy UPDATE calls to data stores that already support those 
operations. I know that other stores could also implement the UPDATE call, but 
a file-based store would need to do all of the reading, processing, and writing 
on the driver where this call is made. That's a surprising implementation given 
that Spark is intended for distributed SQL operations.
   Yea, it's a problem I didn't considered before.
   > Adding the push-down API first doesn't necessarily mean that 
implementations would call into Spark like I'm concerned about. That's why I 
want to understand your use case. If you really need Spark to push operations 
to JDBC right away, then it may make sense to add this. But if you don't have 
an urgent need for it, I think Spark users would be better off designing and 
building the Spark-based implementation first.
   I'm not very urgent for the API. The situation I'm facing is, some of our 
users are using the on-cloud distributed data warehouse as well. They of course 
don't want to use two interfaces to for spark task over the data warehouse. But 
if you prefer that we building the spark-based implementation first, I'm OK 
with that. I wonder, if we finally need this API and we target the two (this 
API and the "spark-based" implementation) both at spark-3.0, what's the 
difference which is first? Sorry for this question as I don't know the roadmap 
of the 3.0 release.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

Reply via email to