[jira] [Commented] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created

xi chaomin (Jira) Sat, 06 May 2023 06:23:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720215#comment-17720215
 ]


xi chaomin commented on HUDI-6144:
----------------------------------

Currently bucket index doesn't support bulk insert. This may be solved by 
HUDI-5994.

> [Spark][Flink]bucket index and then insert data in bulk, the correct file 
> cannot be created
> -------------------------------------------------------------------------------------------
>
>                 Key: HUDI-6144
>                 URL: https://issues.apache.org/jira/browse/HUDI-6144
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: flink-sql, spark-sql
>    Affects Versions: 0.14.0
>            Reporter: lizhiqiang
>            Priority: Blocker
>             Fix For: 0.14.0, 1.0.0
>
>         Attachments: image-2023-04-27-14-49-12-731.png
>
>
> When I use bucket index and then insert data in bulk, the correct file cannot 
> be created, and the prefix of the file cannot be replaced with the bucket ID.
> I have an idea 
> 1. When creating a table, all files are created, and the number of files is 
> equal to the number of buckets. And replace the prefix of the file with the 
> bucket id. 
> 2. Build a hash table in memory, the key of this hash table corresponds to 
> the bucket ID, and maps to the path of the file, the value is cached in the 
> hash table first, and when the configured threshold is reached, you can flush 
> the key mapped file. 
> 3. This part of the value below the hash table can be sorted in memory first.
> 1. create table and insert data
>  
> {code:java}
> create table xxx.B (
> id int,
> name string,
> price double,
> ts long,
> dt string
> ) using hudi
> tblproperties (
> type = 'mor',
> primaryKey = 'id',
> preCombineField = 'ts',
> hoodie.index.type = 'BUCKET',
> hoodie.bucket.index.num.buckets = '4'
> );
>  
> insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code}
> 2. Insert data at the same time as creating a table, the default is bulk 
> insert
> {code:java}
> – create table and insert some data.
> create table xxx.A using hudi
> tblproperties (
> type = 'mor',
> primaryKey = 'id',
> preCombineField = 'ts',
> hoodie.index.type = 'BUCKET',
> hoodie.sql.bulk.insert.enable= 'false',
> hoodie.datasource.write.operation = 'upsert',
> hoodie.bucket.index.num.buckets = '4'
> ) as select id,name,price,ts,dt from xxx.B;{code}
> – default is bulk insert.
> 3. the prefix of the file cannot be replaced with the bucket ID
> !image-2023-04-27-14-49-12-731.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created

Reply via email to