hudi-bot opened a new issue, #14518:
URL: https://github.com/apache/hudi/issues/14518

   As we know, Hudi use spark datasource api to upsert data. For example, if we 
want to update a data, we need to get the old row's data first, and use upsert 
method to update this row.
   But there's another situation where someone just wants to update one column 
of data. If we use a sql to describe, it is {{update table set col1 = X where 
col2 = Y}}. This is something hudi cannot deal with directly at present, we can 
only get all the data involved as a dataset first and then merge it.
   So I think maybe we can create a new subproject to process the batch data in 
an sql-like method. For example.
   
    {code}
   val hudiTable = new HudiTable(path)
   hudiTable.update.set("col1 = X").where("col2 = Y")
   hudiTable.delete.where("col3 = Z")
   hudiTable.commit
   {code}
   
   It may also extend the functionality and support jdbc-like RFC schemes: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller]
   
   Hope every one can provide some suggestions to see if this plan is feasible.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-481
   - Type: Improvement
   - Epic: https://issues.apache.org/jira/browse/HUDI-4141
   
   
   ---
   
   
   ## Comments
   
   30/Dec/19 18:17;vinoth;I am not sure if `CLI` is the right component for 
this. First few questions before I can triage this.. 
   
    
    * Is this intended to be a Spark API? We have thought about adding support 
in Spark SQL to specify the merge logic vs HoodieRecordPayload interface.. This 
sounds similar. 
    * I think we need to move towards Spark Datasource V2 api first.. and then 
rethink how this will fit in HUDI-30;;;
   
   ---
   
   07/Jan/20 02:57;chenxiang;[~vinoth]
   I checked the spark project. It seems that the spark SQL syntax tree only 
supports *DELETE* keyword at present. *UPDATE* and *MERGE* are not supported 
yet. I think this may be because the design idea of spark is to deal with the 
relationship between dataset and dataset. Using existing operators can solve 
similar problems, but it is not sql-like.
   My current idea is to build a layer of SQL syntax on the *hudi-core*, and 
properly enable antlr4 to process semantics. For example, the update statement 
can be parsed into first filtering data according to where conditions, and then 
upsert the data into hudi.;;;
   
   ---
   
   08/Jan/20 02:28;vinoth;Hi [~chenxiang] , 
[https://github.com/apache/spark/blob/master/docs/sql-keywords.md] does list 
DELETE and UPDATE keywords in the language itself.. I think its upto to the 
datasource to implement this. We can consider this once we move to datasource 
v2 first? Is nt that pre-req for this ;;;
   
   ---
   
   09/Jan/20 01:11;chenxiang;[~vinoth]
   
   Oh~ I‘ve seen in 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4]
 
   
   Spark can recognize these keywords. 
     
    I have built a project to try to see if it is feasible first. I don't know 
if it is related to V1 or V2. If I have a result, I will let you know as soon 
as possible;;;
   
   ---
   
   10/Jan/20 00:39;chenxiang;[~vinoth]
   
   I opened Spark's GitHub again this morning, and suddenly found that 
yesterday I was looking at the master branch (spark 3.0). When I switched to 
version 2.4, there was no *UPDATE* or *MERGE* keywords. This shows that spark 
does not support these keywordsin in version 2.4, which may be a problem.;;;
   
   ---
   
   12/Jan/20 07:09;vinoth;I see.. so if the future 3.x versions will have it, 
its fine right? we can just build based off that? ;;;
   
   ---
   
   14/Oct/20 00:35;chenxiang;I have created a github project in 
https://github.com/shangyuantech/hudi-sql . Later, I can show some design ideas 
and usage scenarios on this project .
   
   ;;;
   
   ---
   
   14/Oct/20 02:13;x1q1j1;hi [~chenxiang]There is a part of the syntax that 
needs to be extended. We can further refine it.  Contrast 
https://docs.delta.io/latest/delta-update.html#update-a-table.;;;
   
   ---
   
   14/Oct/20 03:40;309637554;[~chenxiang] [~x1q1j1]  hi, i also have some plan 
about this https://issues.apache.org/jira/browse/HUDI-1341. We can often 
discuss :D;;;
   
   ---
   
   14/Oct/20 05:19;chenxiang;[~309637554] Glad to see your attention. I've 
added a *relates to* link with HUDI-1341;;;
   
   ---
   
   20/Oct/20 22:06;vinoth;>a. If we use a sql to describe, it is {{update table 
set col1 = X where col2 = Y}}. This is something hudi cannot deal with directly 
at present, we can only get all the data involved as a dataset first and then 
merge it.
   
   I don't think we can avoid getting the dataset first i.e read the older 
parquet file to merge the record. In fact, I would argue that Hudi uniquely 
let's you deal with a single column update scenario now, by allowing custom 
payloads to specify merging. i.e base file can contain the entire record and 
the log can just contain the updated col value and we will be able to merge 
this .
   
    
   
   What we are missing is the SQL support for Merges, which we should build out 
under HUDI-1297 's scope. wdyt? ;;;
   
   ---
   
   21/Oct/20 14:32;309637554;[~vinoth] agree with you  .
   
   1、 at present can not avoid getting the dataset first. agree with you for  
the log can just contain the updated col value and we will be able to merge 
this . If we have column statistic or clustering like z-ordering index, this 
scenario can be optimized.
   
   2. I see hudi support spark 3.0 will land it.   We can build the sql API  
HUDI-1297  on spark datasource 2.0 API.  can build under HUDI-1297 ;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to