[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720215#comment-17720215 ]
xi chaomin commented on HUDI-6144: ---------------------------------- Currently bucket index doesn't support bulk insert. This may be solved by HUDI-5994. > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > ------------------------------------------------------------------------------------------- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql, spark-sql > Affects Versions: 0.14.0 > Reporter: lizhiqiang > Priority: Blocker > Fix For: 0.14.0, 1.0.0 > > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default is bulk > insert > {code:java} > – create table and insert some data. > create table xxx.A using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.sql.bulk.insert.enable= 'false', > hoodie.datasource.write.operation = 'upsert', > hoodie.bucket.index.num.buckets = '4' > ) as select id,name,price,ts,dt from xxx.B;{code} > – default is bulk insert. > 3. the prefix of the file cannot be replaced with the bucket ID > !image-2023-04-27-14-49-12-731.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)