[jira] [Updated] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created
[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6144: - Labels: pull-request-available (was: ) > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > --- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql, spark-sql >Affects Versions: 0.14.0 >Reporter: lizhiqiang >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default is bulk > insert > {code:java} > – create table and insert some data. > create table xxx.A using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.sql.bulk.insert.enable= 'false', > hoodie.datasource.write.operation = 'upsert', > hoodie.bucket.index.num.buckets = '4' > ) as select id,name,price,ts,dt from xxx.B;{code} > – default is bulk insert. > 3. the prefix of the file cannot be replaced with the bucket ID > !image-2023-04-27-14-49-12-731.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created
[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6144: - Fix Version/s: (was: 1.0.0) > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > --- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql, spark-sql >Affects Versions: 0.14.0 >Reporter: lizhiqiang >Priority: Blocker > Fix For: 0.14.0 > > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default is bulk > insert > {code:java} > – create table and insert some data. > create table xxx.A using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.sql.bulk.insert.enable= 'false', > hoodie.datasource.write.operation = 'upsert', > hoodie.bucket.index.num.buckets = '4' > ) as select id,name,price,ts,dt from xxx.B;{code} > – default is bulk insert. > 3. the prefix of the file cannot be replaced with the bucket ID > !image-2023-04-27-14-49-12-731.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created
[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6144: Fix Version/s: 0.14.0 1.0.0 > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > --- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql, spark-sql >Affects Versions: 0.14.0 >Reporter: lizhiqiang >Priority: Blocker > Fix For: 0.14.0, 1.0.0 > > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default is bulk > insert > {code:java} > – create table and insert some data. > create table xxx.A using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.sql.bulk.insert.enable= 'false', > hoodie.datasource.write.operation = 'upsert', > hoodie.bucket.index.num.buckets = '4' > ) as select id,name,price,ts,dt from xxx.B;{code} > – default is bulk insert. > 3. the prefix of the file cannot be replaced with the bucket ID > !image-2023-04-27-14-49-12-731.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created
[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6144: Priority: Blocker (was: Major) > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > --- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql, spark-sql >Affects Versions: 0.14.0 >Reporter: lizhiqiang >Priority: Blocker > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default is bulk > insert > {code:java} > – create table and insert some data. > create table xxx.A using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.sql.bulk.insert.enable= 'false', > hoodie.datasource.write.operation = 'upsert', > hoodie.bucket.index.num.buckets = '4' > ) as select id,name,price,ts,dt from xxx.B;{code} > – default is bulk insert. > 3. the prefix of the file cannot be replaced with the bucket ID > !image-2023-04-27-14-49-12-731.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created
[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lizhiqiang updated HUDI-6144: - Affects Version/s: 0.14.0 > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > --- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: lizhiqiang >Priority: Major > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default is bulk > insert > {code:java} > – create table and insert some data. > create table xxx.A using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.sql.bulk.insert.enable= 'false', > hoodie.datasource.write.operation = 'upsert', > hoodie.bucket.index.num.buckets = '4' > ) as select id,name,price,ts,dt from xxx.B;{code} > – default is bulk insert. > 3. the prefix of the file cannot be replaced with the bucket ID > !image-2023-04-27-14-49-12-731.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created
[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lizhiqiang updated HUDI-6144: - Component/s: flink-sql spark-sql > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > --- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug > Components: flink-sql, spark-sql >Affects Versions: 0.14.0 >Reporter: lizhiqiang >Priority: Major > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default is bulk > insert > {code:java} > – create table and insert some data. > create table xxx.A using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.sql.bulk.insert.enable= 'false', > hoodie.datasource.write.operation = 'upsert', > hoodie.bucket.index.num.buckets = '4' > ) as select id,name,price,ts,dt from xxx.B;{code} > – default is bulk insert. > 3. the prefix of the file cannot be replaced with the bucket ID > !image-2023-04-27-14-49-12-731.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6144) [Spark][Flink]bucket index and then insert data in bulk, the correct file cannot be created
[ https://issues.apache.org/jira/browse/HUDI-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lizhiqiang updated HUDI-6144: - Description: When I use bucket index and then insert data in bulk, the correct file cannot be created, and the prefix of the file cannot be replaced with the bucket ID. I have an idea 1. When creating a table, all files are created, and the number of files is equal to the number of buckets. And replace the prefix of the file with the bucket id. 2. Build a hash table in memory, the key of this hash table corresponds to the bucket ID, and maps to the path of the file, the value is cached in the hash table first, and when the configured threshold is reached, you can flush the key mapped file. 3. This part of the value below the hash table can be sorted in memory first. 1. create table and insert data {code:java} create table xxx.B ( id int, name string, price double, ts long, dt string ) using hudi tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'ts', hoodie.index.type = 'BUCKET', hoodie.bucket.index.num.buckets = '4' ); insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} 2. Insert data at the same time as creating a table, the default is bulk insert {code:java} – create table and insert some data. create table xxx.A using hudi tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'ts', hoodie.index.type = 'BUCKET', hoodie.sql.bulk.insert.enable= 'false', hoodie.datasource.write.operation = 'upsert', hoodie.bucket.index.num.buckets = '4' ) as select id,name,price,ts,dt from xxx.B;{code} – default is bulk insert. 3. the prefix of the file cannot be replaced with the bucket ID !image-2023-04-27-14-49-12-731.png! was: When I use bucket index and then insert data in bulk, the correct file cannot be created, and the prefix of the file cannot be replaced with the bucket ID. I have an idea 1. When creating a table, all files are created, and the number of files is equal to the number of buckets. And replace the prefix of the file with the bucket id. 2. Build a hash table in memory, the key of this hash table corresponds to the bucket ID, and maps to the path of the file, the value is cached in the hash table first, and when the configured threshold is reached, you can flush the key mapped file. 3. This part of the value below the hash table can be sorted in memory first. 1. create table and insert data ```sql create table xxx.B ( id int, name string, price double, ts long, dt string ) using hudi tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'ts', hoodie.index.type = 'BUCKET', hoodie.bucket.index.num.buckets = '4' ); insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05'); ``` 2. Insert data at the same time as creating a table, the default is bulk insert ```java -- create table and insert some data. create table xxx.A using hudi tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'ts', hoodie.index.type = 'BUCKET', hoodie.sql.bulk.insert.enable= 'false', hoodie.datasource.write.operation = 'upsert', hoodie.bucket.index.num.buckets = '4' ) as select id,name,price,ts,dt from xxx.B; ``` -- default is bulk insert. 3. the prefix of the file cannot be replaced with the bucket ID !image-2023-04-27-14-49-12-731.png! > [Spark][Flink]bucket index and then insert data in bulk, the correct file > cannot be created > --- > > Key: HUDI-6144 > URL: https://issues.apache.org/jira/browse/HUDI-6144 > Project: Apache Hudi > Issue Type: Bug >Reporter: lizhiqiang >Priority: Major > Attachments: image-2023-04-27-14-49-12-731.png > > > When I use bucket index and then insert data in bulk, the correct file cannot > be created, and the prefix of the file cannot be replaced with the bucket ID. > I have an idea > 1. When creating a table, all files are created, and the number of files is > equal to the number of buckets. And replace the prefix of the file with the > bucket id. > 2. Build a hash table in memory, the key of this hash table corresponds to > the bucket ID, and maps to the path of the file, the value is cached in the > hash table first, and when the configured threshold is reached, you can flush > the key mapped file. > 3. This part of the value below the hash table can be sorted in memory first. > 1. create table and insert data > > {code:java} > create table xxx.B ( > id int, > name string, > price double, > ts long, > dt string > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'id', > preCombineField = 'ts', > hoodie.index.type = 'BUCKET', > hoodie.bucket.index.num.buckets = '4' > ); > > insert into xxx.B values (5, 'a', 35, 1000, '2021-01-05');{code} > 2. Insert data at the same time as creating a table, the default