[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-09-22 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-533959893
 
 
   Ok, it looks a nice plan. I'll update the code soon.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-09-14 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-531479558
 
 
   @rdblue , thank you very much for your valuable comments.
   > Implementations that call into Spark would not share code (or, possibly, 
behavior) for a standard > SQL operation. I would much rather work with people 
building these sources to get the implementation into Spark so that everyone 
can benefit, so that the code gets reviewed, and so that we can test thoroughly.
   Generally aggree with you. From spark's view point, we don't want add 
something that cannot be controlled by the framework. However, the essence of 
spark is RDD, a kind of big volumn data processing framework; what we're 
working is datasource API, which we want can handle as many sources as 
possible. But not all of sources can fit into RDD, one of which is JDBC, thus 
we have to do some compromises: let datasources do that spark is not good at. 
For the problem that implementations call into spark, we don't want 
implementations do that, but we cannot probibit(or avoid) it either. We already 
have a push-down DELETE there.
   > It seems strange to me to add the push-down API first, when it would be 
used to proxy UPDATE calls to data stores that already support those 
operations. I know that other stores could also implement the UPDATE call, but 
a file-based store would need to do all of the reading, processing, and writing 
on the driver where this call is made. That's a surprising implementation given 
that Spark is intended for distributed SQL operations.
   Yea, it's a problem I didn't considered before.
   > Adding the push-down API first doesn't necessarily mean that 
implementations would call into Spark like I'm concerned about. That's why I 
want to understand your use case. If you really need Spark to push operations 
to JDBC right away, then it may make sense to add this. But if you don't have 
an urgent need for it, I think Spark users would be better off designing and 
building the Spark-based implementation first.
   I'm not very urgent for the API. The situation I'm facing is, some of our 
users are using the on-cloud distributed data warehouse as well. They of course 
don't want to use two interfaces to for spark task over the data warehouse. But 
if you prefer that we building the spark-based implementation first, I'm OK 
with that. I wonder, if we finally need this API and we target the two (this 
API and the "spark-based" implementation) both at spark-3.0, what's the 
difference which is first? Sorry for this question as I don't know the roadmap 
of the 3.0 release.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-09-11 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-530662601
 
 
   @rdblue, JDBC is the most important use case for this API I've seen. The 
building of this API for JDBC is not started yet, as the public expressions 
need to be improved if we plan to do that.
   
   Could you explain your concern about this API? Were you feeling this is not 
needed, or we must implement the "row-based" API first? Any comments or 
suggestions are welcome.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-09-11 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-530425287
 
 
   Seems the test failure is not introduced by this pr.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-09-10 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-529989458
 
 
   Thank you very much @rdblue . What you comments are exactly right, here the 
API is just a kind of  "push-down" API. We also considered the "row-based" API, 
which may involve a builder, and start a spark job to calculate which row 
should be deleted, and issue the deletions to spark tasks.
   Both the "push-down" API and the "row-based" API is need, and the two have 
different scenarios. For some data source like JDBC, a simple "push-down" API 
may work, but for some cases like multi-table delete/update, a "row-based" API 
is need. Sometimes a "push-down" API is a preferable alternative than 
"row-based" if both the two can finish the work. Like JDBC, pushing-down the 
predicates (and the updated value) has a much better performance than 
deleting/updating the data row by row.
   Since the design of "push-down" API is simpler then the "row-based", we 
propose this API first. Later we can consider to add "row-based" API in 
`SupportsDelete`/`SupportUpdate`. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-09-04 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-527759665
 
 
   > @xianyinxin, can you explain the required semantics for your proposed API?
   > 
   > ```
   > void updateWhere(Map sets, Filter[] filters);
   > ```
   > 
   > It isn't clear what the requirement for a source would be.
   > 
   > In addition, `Expression` is internal to catalyst and should be removed 
from the API.
   
   Thank you @rdblue . Here Expression is the public datasource expression 
org.apache.spark.sql.catalog.v2.expressions.Expression. Datasource needs to 
know the value that the field be updated to, so here the key of sets specifies 
the field to be updated, and the value of sets is the updated value. The value 
can be an expression, not just a literal, IMHO, like the case `UPDATE tbl SET 
a=a+1 WHERE ...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-09-02 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-527123585
 
 
   > The test coverage is not good enough. For example, all the test cases are 
just updating a single column? Try to check the test cases in the other open 
source databases?
   
   Hi @gatorsmile , I added more test cases for update, including multi-fields 
updating, nested fields updating, etc. pls review.
   Also checked the test cases in postgresql. There're many cases for other 
syntax like updating via sub-query, updating multi-fields with one assignment 
which is not supported current by spark sql.
   And updating a field with an expression, we plan to add more public 
expressions for datasource, after which we can support that. Will add more test 
cases then.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

2019-08-29 Thread GitBox
xianyinxin commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-526436021
 
 
   cc @cloud-fan @rdblue 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org