hudi-bot opened a new issue, #14518:
URL: https://github.com/apache/hudi/issues/14518
As we know, Hudi use spark datasource api to upsert data. For example, if we
want to update a data, we need to get the old row's data first, and use upsert
method to update this row.
But there's another situation where someone just wants to update one column
of data. If we use a sql to describe, it is {{update table set col1 = X where
col2 = Y}}. This is something hudi cannot deal with directly at present, we can
only get all the data involved as a dataset first and then merge it.
So I think maybe we can create a new subproject to process the batch data in
an sql-like method. For example.
{code}
val hudiTable = new HudiTable(path)
hudiTable.update.set("col1 = X").where("col2 = Y")
hudiTable.delete.where("col3 = Z")
hudiTable.commit
{code}
It may also extend the functionality and support jdbc-like RFC schemes:
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
Hope every one can provide some suggestions to see if this plan is feasible.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-481
- Type: Improvement
- Epic: https://issues.apache.org/jira/browse/HUDI-4141
---
## Comments
30/Dec/19 18:17;vinoth;I am not sure if `CLI` is the right component for
this. First few questions before I can triage this..
* Is this intended to be a Spark API? We have thought about adding support
in Spark SQL to specify the merge logic vs HoodieRecordPayload interface.. This
sounds similar.
* I think we need to move towards Spark Datasource V2 api first.. and then
rethink how this will fit in HUDI-30;;;
---
07/Jan/20 02:57;chenxiang;[~vinoth]
I checked the spark project. It seems that the spark SQL syntax tree only
supports *DELETE* keyword at present. *UPDATE* and *MERGE* are not supported
yet. I think this may be because the design idea of spark is to deal with the
relationship between dataset and dataset. Using existing operators can solve
similar problems, but it is not sql-like.
My current idea is to build a layer of SQL syntax on the *hudi-core*, and
properly enable antlr4 to process semantics. For example, the update statement
can be parsed into first filtering data according to where conditions, and then
upsert the data into hudi.;;;
---
08/Jan/20 02:28;vinoth;Hi [~chenxiang] ,
[https://github.com/apache/spark/blob/master/docs/sql-keywords.md] does list
DELETE and UPDATE keywords in the language itself.. I think its upto to the
datasource to implement this. We can consider this once we move to datasource
v2 first? Is nt that pre-req for this ;;;
---
09/Jan/20 01:11;chenxiang;[~vinoth]
Oh~ I‘ve seen in
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4]
Spark can recognize these keywords.
I have built a project to try to see if it is feasible first. I don't know
if it is related to V1 or V2. If I have a result, I will let you know as soon
as possible;;;
---
10/Jan/20 00:39;chenxiang;[~vinoth]
I opened Spark's GitHub again this morning, and suddenly found that
yesterday I was looking at the master branch (spark 3.0). When I switched to
version 2.4, there was no *UPDATE* or *MERGE* keywords. This shows that spark
does not support these keywordsin in version 2.4, which may be a problem.;;;
---
12/Jan/20 07:09;vinoth;I see.. so if the future 3.x versions will have it,
its fine right? we can just build based off that? ;;;
---
14/Oct/20 00:35;chenxiang;I have created a github project in
https://github.com/shangyuantech/hudi-sql . Later, I can show some design ideas
and usage scenarios on this project .
;;;
---
14/Oct/20 02:13;x1q1j1;hi [~chenxiang]There is a part of the syntax that
needs to be extended. We can further refine it. Contrast
https://docs.delta.io/latest/delta-update.html#update-a-table.;;;
---
14/Oct/20 03:40;309637554;[~chenxiang] [~x1q1j1] hi, i also have some plan
about this https://issues.apache.org/jira/browse/HUDI-1341. We can often
discuss :D;;;
---
14/Oct/20 05:19;chenxiang;[~309637554] Glad to see your attention. I've
added a *relates to* link with HUDI-1341;;;
---
20/Oct/20 22:06;vinoth;>a. If we use a sql to describe, it is {{update table
set col1 = X where col2 = Y}}. This is something hudi cannot deal with directly
at present, we can only get all the data involved as a dataset first and then
merge it.
I don't think we can avoid getting the dataset first i.e read the older
parquet file to merge the record. In fact, I would argue that Hudi uniquely
let's you deal with a single column update scenario now, by allowing custom
payloads to specify merging. i.e base file can contain the entire record and
the log can just contain the updated col value and we will be able to merge
this .
What we are missing is the SQL support for Merges, which we should build out
under HUDI-1297 's scope. wdyt? ;;;
---
21/Oct/20 14:32;309637554;[~vinoth] agree with you .
1、 at present can not avoid getting the dataset first. agree with you for
the log can just contain the updated col value and we will be able to merge
this . If we have column statistic or clustering like z-ordering index, this
scenario can be optimized.
2. I see hudi support spark 3.0 will land it. We can build the sql API
HUDI-1297 on spark datasource 2.0 API. can build under HUDI-1297 ;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]