[jira] [Updated] (HUDI-7143) schema evolution triggers a CDC query exception
[ https://issues.apache.org/jira/browse/HUDI-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-7143: --- Description: {code:sql} sparkSession.sql("CREATE TABLE if not exists hudi_ut_schema_evolution (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI PARTITIONED BY (inc_day) TBLPROPERTIES (hoodie.table.cdc.enabled='true', type='cow', primaryKey='id')"); 20231127201042503.commit: sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-01' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; "); 20231127201113131.commit: sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id); "); 20231127201124255.commit: sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-01' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; "); 20231127201146659.commit: sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1"); 20231127201157382.commit: sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 int)"); 20231127201208532.commit: sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as add1, 3 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-01' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; "); sparkSession.sql("select * from hudi_ut_schema_evolution").show(100, false); +---+-+--+--+-+---+---+-+---++--+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |version|name |birthDate |add1|inc_day | +---+-+--+--+-+---+---+-+---++--+ |20231127201042503 |20231127201042503_0_0|1 |inc_day=2023-10-01 |2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|1 |1 |str_1|2023-01-01 12:12:12|null|2023-10-01| |20231127201124255 |20231127201124255_0_1|2 |inc_day=2023-10-01 |2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|2 |1 |str_1|2023-01-01 12:12:12|null|2023-10-01| |20231127201208532 |20231127201208532_0_2|3 |inc_day=2023-10-01 |2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|3 |1 |str_1|2023-01-01 12:12:12|1 |2023-10-01| +---+-+--+--+-+---+---+-+---++--+ sparkSession.sql("select * from hudi_table_changes('hudi_ut_schema_evolution','cdc','20231127201042503','20231127201208532')").show(100, false); exception: org.apache.avro.AvroTypeException: Found string, expecting union at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at
[jira] [Created] (HUDI-7143) schema evolution triggers a CDC query exception
loukey_j created HUDI-7143: -- Summary: schema evolution triggers a CDC query exception Key: HUDI-7143 URL: https://issues.apache.org/jira/browse/HUDI-7143 Project: Apache Hudi Issue Type: Bug Components: spark Reporter: loukey_j {code:sql} sparkSession.sql("CREATE TABLE if not exists hudi_ut_schema_evolution (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', hoodie.table.cdc.enabled='true', type='cow', primaryKey='id')"); 20231127201042503.commit: sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-01' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; "); 20231127201113131.commit: sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id); "); 20231127201124255.commit: sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-01' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; "); 20231127201146659.commit: sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1"); 20231127201157382.commit: sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 int)"); 20231127201208532.commit: sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as add1, 3 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-01' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; "); sparkSession.sql("select * from hudi_ut_schema_evolution").show(100, false); +---+-+--+--+-+---+---+-+---++--+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |version|name |birthDate |add1|inc_day | +---+-+--+--+-+---+---+-+---++--+ |20231127201042503 |20231127201042503_0_0|1 |inc_day=2023-10-01 |2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|1 |1 |str_1|2023-01-01 12:12:12|null|2023-10-01| |20231127201124255 |20231127201124255_0_1|2 |inc_day=2023-10-01 |2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|2 |1 |str_1|2023-01-01 12:12:12|null|2023-10-01| |20231127201208532 |20231127201208532_0_2|3 |inc_day=2023-10-01 |2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|3 |1 |str_1|2023-01-01 12:12:12|1 |2023-10-01| +---+-+--+--+-+---+---+-+---++--+ sparkSession.sql("select * from hudi_table_changes('hudi_ut_schema_evolution','cdc','20231127201042503','20231127201208532')").show(100, false); org.apache.avro.AvroTypeException: Found string, expecting union at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247) at
[jira] [Commented] (HUDI-7131) The requested schema is not compatible with the file schema
[ https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789922#comment-17789922 ] loukey_j commented on HUDI-7131: [~xushiyan] please take a look > The requested schema is not compatible with the file schema > --- > > Key: HUDI-7131 > URL: https://issues.apache.org/jira/browse/HUDI-7131 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: loukey_j >Priority: Critical > Labels: core, merge, spark > > use global Index and data partition change , report an error: The requested > schema is not compatible with the file schema... > Why not use the schema of > org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal > to read hudi data > > CREATE TABLE if not exists unisql.hudi_ut_time_traval > (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING > HUDI > PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id'); > insert into unisql.hudi_ut_time_traval > select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day; > select * from hudi_ut_time_traval; > +---+-+--+--++---+---+-+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id > |version|name |birthDate |inc_day | > +---+-+--+--++---+---+-+---+--+ > |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 > |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 > |1 |str_1|2023-01-01 12:12:12|2023-10-01| > +---+-+--+--++---+---+-+---+--+ > merge into hudi_ut_time_traval t using ( > select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day > ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * > Caused by: org.apache.parquet.io.ParquetDecodingException: The requested > schema is not compatible with the file schema. incompatible types: required > int32 id != optional int32 id > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) > at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225); > parquet schema: > { > "type" : "record", > "name" : "hudi_ut_time_traval_record", > "namespace" : "hoodie.hudi_ut_time_traval", > "fields" : [ { > "name" : "_hoodie_commit_time", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_commit_seqno", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_record_key", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_partition_path", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_file_name", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "id", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "version", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "name", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "birthDate", > "type" : [ "null", { > "type" : "long", > "logicalType" : "timestamp-micros" > } ], > "default" : null > }, { > "name" : "inc_day", > "type" : [ "null", "string" ], > "default" : null > } ] > } > org.apache.hudi.io.HoodieMergedReadHandle#readerSchema: >
[jira] [Updated] (HUDI-7134) After deleting the field and re-executing the merge, the result is not as expected.
[ https://issues.apache.org/jira/browse/HUDI-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-7134: --- Description: You can reproduce the problem by following the steps below. The value of add1 in step 7 is not as expected. 1、CREATE TABLE if not exists hudi_ut_schema_evolution (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', type='cow', primaryKey='id') 2、merge into hudi_ut_schema_evolution t using ( select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 3、ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id); 4、merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 5、ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1; 6、select {color:red}'1' as add1{color}, 3 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; 7、select * from hudi_ut_schema_evolution; +---+-+--+--+-++---+---+-+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |add1|id |version|name |birthDate |inc_day | +---+-+--+--+-++---+---+-+---+--+ |20231122164141030 |20231122164141030_0_0|1 |inc_day=2023-10-02 |9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|1 |2 |str_1|2023-01-01 12:12:12|2023-10-02| |20231122165045413 |20231122165045413_0_1|2 |inc_day=2023-10-02 |9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|2 |2 |str_1|2023-01-01 12:12:12|2023-10-02| |20231122165413036 |20231122165413036_0_2|3 |inc_day=2023-10-02 |9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|{color:red}null{color}|3 |2 |str_1|2023-01-01 12:12:12|2023-10-02| +---+-+--+--+-++---+---+-+---+--+ 8、show create table hudi_ut_schema_evolution; CREATE TABLE unisql.hudi_ut_schema_evolution ( `_hoodie_commit_time` STRING COMMENT '', `_hoodie_commit_seqno` STRING COMMENT '', `_hoodie_record_key` STRING COMMENT '', `_hoodie_partition_path` STRING COMMENT '', `_hoodie_file_name` STRING COMMENT '', {color:red}`add1` STRING, `id` INT,{color} `version` INT, `name` STRING, `birthDate` TIMESTAMP, `inc_day` STRING) PARTITIONED BY (inc_day) TBLPROPERTIES( 'hoodie.query.as.ro.table' = 'false', 'last_commit_completion_time_sync' = '20231122171640801', 'last_commit_time_sync' = '20231122171627218', 'primaryKey' = 'id', 'type' = 'cow') was: {code:java} 1、CREATE TABLE if not exists hudi_ut_schema_evolution (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', type='cow', primaryKey='id') 2、merge into hudi_ut_schema_evolution t using ( select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 3、ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id); 4、merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 5、ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1; 6、select {color:red}'1' as add1{color}, 3 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; 7、select * from hudi_ut_schema_evolution;
[jira] [Created] (HUDI-7134) After deleting the field and re-executing the merge, the result is not as expected.
loukey_j created HUDI-7134: -- Summary: After deleting the field and re-executing the merge, the result is not as expected. Key: HUDI-7134 URL: https://issues.apache.org/jira/browse/HUDI-7134 Project: Apache Hudi Issue Type: Bug Components: spark Affects Versions: 0.14.0 Environment: hudi 0.14 spark 3.2.1 Reporter: loukey_j {code:java} 1、CREATE TABLE if not exists hudi_ut_schema_evolution (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', type='cow', primaryKey='id') 2、merge into hudi_ut_schema_evolution t using ( select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 3、ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id); 4、merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 5、ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1; 6、select {color:red}'1' as add1{color}, 3 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; 7、select * from hudi_ut_schema_evolution; +---+-+--+--+-++---+---+-+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |add1|id |version|name |birthDate |inc_day | +---+-+--+--+-++---+---+-+---+--+ |20231122164141030 |20231122164141030_0_0|1 |inc_day=2023-10-02 |9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|1 |2 |str_1|2023-01-01 12:12:12|2023-10-02| |20231122165045413 |20231122165045413_0_1|2 |inc_day=2023-10-02 |9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|2 |2 |str_1|2023-01-01 12:12:12|2023-10-02| |20231122165413036 |20231122165413036_0_2|3 |inc_day=2023-10-02 |9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|{color:red}null{color}|3 |2 |str_1|2023-01-01 12:12:12|2023-10-02| +---+-+--+--+-++---+---+-+---+--+ 8、show create table hudi_ut_schema_evolution; CREATE TABLE unisql.hudi_ut_schema_evolution ( `_hoodie_commit_time` STRING COMMENT '', `_hoodie_commit_seqno` STRING COMMENT '', `_hoodie_record_key` STRING COMMENT '', `_hoodie_partition_path` STRING COMMENT '', `_hoodie_file_name` STRING COMMENT '', {color:red}`add1` STRING, `id` INT,{color} `version` INT, `name` STRING, `birthDate` TIMESTAMP, `inc_day` STRING) PARTITIONED BY (inc_day) TBLPROPERTIES( 'hoodie.query.as.ro.table' = 'false', 'last_commit_completion_time_sync' = '20231122171640801', 'last_commit_time_sync' = '20231122171627218', 'primaryKey' = 'id', 'type' = 'cow') {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7131) The requested schema is not compatible with the file schema
[ https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788667#comment-17788667 ] loukey_j commented on HUDI-7131: sorry, I didn't notice that I converted inc_day to date type. Later I corrected the SQL and got the same error. Execute the following sqls to reproduce. The root cause of the problem is that hoodieWriteConfig.getSchema() is incompatible with the schema of hudi table 1. CREATE TABLE if not exists hudi_ut_time_traval (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id'); 2. merge into hudi_ut_time_traval using (select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, {color:red}'2023-10-01'{color} as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * 3. merge into hudi_ut_time_traval using (select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, {color:red}'2023-10-02' {color}as inc_day) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * > The requested schema is not compatible with the file schema > --- > > Key: HUDI-7131 > URL: https://issues.apache.org/jira/browse/HUDI-7131 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: loukey_j >Priority: Critical > Labels: core, merge, spark > > use global Index and data partition change , report an error: The requested > schema is not compatible with the file schema... > Why not use the schema of > org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal > to read hudi data > > CREATE TABLE if not exists unisql.hudi_ut_time_traval > (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING > HUDI > PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id'); > insert into unisql.hudi_ut_time_traval > select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day; > select * from hudi_ut_time_traval; > +---+-+--+--++---+---+-+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id > |version|name |birthDate |inc_day | > +---+-+--+--++---+---+-+---+--+ > |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 > |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 > |1 |str_1|2023-01-01 12:12:12|2023-10-01| > +---+-+--+--++---+---+-+---+--+ > merge into hudi_ut_time_traval t using ( > select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day > ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * > Caused by: org.apache.parquet.io.ParquetDecodingException: The requested > schema is not compatible with the file schema. incompatible types: required > int32 id != optional int32 id > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) > at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225); > parquet schema: > { > "type" : "record", > "name" : "hudi_ut_time_traval_record", > "namespace" : "hoodie.hudi_ut_time_traval", > "fields" : [ { > "name" : "_hoodie_commit_time", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_commit_seqno", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_record_key", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" :
[jira] [Commented] (HUDI-7131) The requested schema is not compatible with the file schema
[ https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788607#comment-17788607 ] loukey_j commented on HUDI-7131: The schema of the table has not changed, only the partition value of the data has changed. > The requested schema is not compatible with the file schema > --- > > Key: HUDI-7131 > URL: https://issues.apache.org/jira/browse/HUDI-7131 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: loukey_j >Priority: Critical > Labels: core, merge, spark > > use global Index and data partition change , report an error: The requested > schema is not compatible with the file schema... > Why not use the schema of > org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal > to read hudi data > > CREATE TABLE if not exists unisql.hudi_ut_time_traval > (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING > HUDI > PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id'); > insert into unisql.hudi_ut_time_traval > select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day; > select * from hudi_ut_time_traval; > +---+-+--+--++---+---+-+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id > |version|name |birthDate |inc_day | > +---+-+--+--++---+---+-+---+--+ > |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 > |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 > |1 |str_1|2023-01-01 12:12:12|2023-10-01| > +---+-+--+--++---+---+-+---+--+ > merge into hudi_ut_time_traval t using ( > select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day > ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * > Caused by: org.apache.parquet.io.ParquetDecodingException: The requested > schema is not compatible with the file schema. incompatible types: required > int32 id != optional int32 id > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) > at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225); > parquet schema: > { > "type" : "record", > "name" : "hudi_ut_time_traval_record", > "namespace" : "hoodie.hudi_ut_time_traval", > "fields" : [ { > "name" : "_hoodie_commit_time", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_commit_seqno", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_record_key", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_partition_path", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_file_name", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "id", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "version", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "name", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "birthDate", > "type" : [ "null", { > "type" : "long", > "logicalType" : "timestamp-micros" > } ], > "default" : null > }, { > "name" : "inc_day", > "type" : [ "null", "string" ], > "default" : null > } ] > } > org.apache.hudi.io.HoodieMergedReadHandle#readerSchema: >
[jira] [Updated] (HUDI-7131) The requested schema is not compatible with the file schema
[ https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-7131: --- Affects Version/s: 0.14.0 > The requested schema is not compatible with the file schema > --- > > Key: HUDI-7131 > URL: https://issues.apache.org/jira/browse/HUDI-7131 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: loukey_j >Priority: Critical > Labels: core, merge, spark > > use global Index and data partition change , report an error: The requested > schema is not compatible with the file schema... > Why not use the schema of > org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal > to read hudi data > > CREATE TABLE if not exists unisql.hudi_ut_time_traval > (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING > HUDI > PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id'); > insert into unisql.hudi_ut_time_traval > select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day; > select * from hudi_ut_time_traval; > +---+-+--+--++---+---+-+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id > |version|name |birthDate |inc_day | > +---+-+--+--++---+---+-+---+--+ > |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 > |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 > |1 |str_1|2023-01-01 12:12:12|2023-10-01| > +---+-+--+--++---+---+-+---+--+ > merge into hudi_ut_time_traval t using ( > select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day > ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * > Caused by: org.apache.parquet.io.ParquetDecodingException: The requested > schema is not compatible with the file schema. incompatible types: required > int32 id != optional int32 id > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) > at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225); > parquet schema: > { > "type" : "record", > "name" : "hudi_ut_time_traval_record", > "namespace" : "hoodie.hudi_ut_time_traval", > "fields" : [ { > "name" : "_hoodie_commit_time", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_commit_seqno", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_record_key", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_partition_path", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_file_name", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "id", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "version", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "name", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "birthDate", > "type" : [ "null", { > "type" : "long", > "logicalType" : "timestamp-micros" > } ], > "default" : null > }, { > "name" : "inc_day", > "type" : [ "null", "string" ], > "default" : null > } ] > } > org.apache.hudi.io.HoodieMergedReadHandle#readerSchema: >
[jira] [Created] (HUDI-7131) The requested schema is not compatible with the file schema
loukey_j created HUDI-7131: -- Summary: The requested schema is not compatible with the file schema Key: HUDI-7131 URL: https://issues.apache.org/jira/browse/HUDI-7131 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j use global Index and data partition change , report an error: The requested schema is not compatible with the file schema... Why not use the schema of org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal to read hudi data CREATE TABLE if not exists unisql.hudi_ut_time_traval (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id'); insert into unisql.hudi_ut_time_traval select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day; select * from hudi_ut_time_traval; +---+-+--+--++---+---+-+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |version|name |birthDate |inc_day | +---+-+--+--++---+---+-+---+--+ |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 |1 |str_1|2023-01-01 12:12:12|2023-10-01| +---+-+--+--++---+---+-+---+--+ merge into hudi_ut_time_traval t using ( select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required int32 id != optional int32 id at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81) at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225); parquet schema: { "type" : "record", "name" : "hudi_ut_time_traval_record", "namespace" : "hoodie.hudi_ut_time_traval", "fields" : [ { "name" : "_hoodie_commit_time", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_commit_seqno", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_record_key", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_partition_path", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "_hoodie_file_name", "type" : [ "null", "string" ], "doc" : "", "default" : null }, { "name" : "id", "type" : [ "null", "int" ], "default" : null }, { "name" : "version", "type" : [ "null", "int" ], "default" : null }, { "name" : "name", "type" : [ "null", "string" ], "default" : null }, { "name" : "birthDate", "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-micros" } ], "default" : null }, { "name" : "inc_day", "type" : [ "null", "string" ], "default" : null } ] } org.apache.hudi.io.HoodieMergedReadHandle#readerSchema: {"type":"record","name":"hudi_ut_time_traval_record","namespace":"hoodie.hudi_ut_time_traval","fields":[\{"name":"id","type":"int"},\{"name":"version","type":"int"},\{"name":"name","type":"string"},\{"name":"birthDate","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"inc_day","type":["null",{"type":"int","logicalType":"date"}],"default":null}]} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5837) Spark ctas error
[ https://issues.apache.org/jira/browse/HUDI-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-5837: -- Assignee: loukey_j > Spark ctas error > > > Key: HUDI-5837 > URL: https://issues.apache.org/jira/browse/HUDI-5837 > Project: Apache Hudi > Issue Type: Improvement >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > {color:#020f22}Error in query: Can not create the managed > table('default.tt_waybill_info_hudi_{color}01428095{color:#020f22}'). The > associated > location('hdfs://{color}xx{color:#020f22}/warehouse/table{color}{color:#020f22}') > already exists.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5837) Spark ctas error
loukey_j created HUDI-5837: -- Summary: Spark ctas error Key: HUDI-5837 URL: https://issues.apache.org/jira/browse/HUDI-5837 Project: Apache Hudi Issue Type: Improvement Reporter: loukey_j {color:#020f22}Error in query: Can not create the managed table('default.tt_waybill_info_hudi_{color}01428095{color:#020f22}'). The associated location('hdfs://{color}xx{color:#020f22}/warehouse/table{color}{color:#020f22}') already exists.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5801) Speed metaTable initializeFileGroups
loukey_j created HUDI-5801: -- Summary: Speed metaTable initializeFileGroups Key: HUDI-5801 URL: https://issues.apache.org/jira/browse/HUDI-5801 Project: Apache Hudi Issue Type: Improvement Reporter: loukey_j org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#initializeFileGroups Too slow when there are many filegroups -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5671) BucketIndexPartitioner partition algorithm skew
loukey_j created HUDI-5671: -- Summary: BucketIndexPartitioner partition algorithm skew Key: HUDI-5671 URL: https://issues.apache.org/jira/browse/HUDI-5671 Project: Apache Hudi Issue Type: Improvement Components: flink, index Reporter: loukey_j Attachments: image-2023-02-01-14-45-33-116.png, image-2023-02-01-14-50-29-703.png, image-2023-02-01-15-00-14-889.png, image-2023-02-01-15-05-15-491.png The online job runs for 13 days and finds that there are subtasks but no data processing, as shown in the figure below, this job uses the update time as the partition, uses the bucket index, the number of buckets is 128, and the write parallelism is 128. The key is uniform because the file size of each bucket is not much different from the storage point of view. After positioning, there is a skew in the shuffle algorithm. !image-2023-02-01-14-45-33-116.png! Potential disadvantages of algorithmic tilt: 1. The memory usage is uneven, some nodes may have high pressure on the JVM, and TM nodes are prone to timeout 2. It may cause the checkpoint to time out, because the data will be flushed to hdfs during the snapshot state. If the skew is serious, it will cause some nodes to take too long and cause timeout. current algorithm: !image-2023-02-01-14-50-29-703.png! Algorithm flaws: 1. curBucket ∈ [0, numBuckets -1] 2. For the number of globalHash values in the same partition <= numBuckets number, globalHash is divergent, and mod(globalHash, numPartitions) is easy to conflict 3. When numBuckets is relatively large, shuffleIndex is prone to conflicts, resulting in skew Algorithm optimization: !image-2023-02-01-15-00-14-889.png! kb = key % b; kb ∈ [0, b-1] pw = pt % w; pw ∈ [0, w-1] shuffleIndex = (pw + kb) % w shuffleIndex ∈ [0, w-1] In fact, it is to calculate a pw according to the partition first. Pw can be understood as a slot Wn allocated to the partition. Different partitions have a slot. Then move b slots back on the basis of this slot as the writing of data for this partition !image-2023-02-01-15-05-15-491.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5568) incorrect use of fileSystemView
[ https://issues.apache.org/jira/browse/HUDI-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-5568: --- Summary: incorrect use of fileSystemView (was: incorrect use of FileSystemView) > incorrect use of fileSystemView > > > Key: HUDI-5568 > URL: https://issues.apache.org/jira/browse/HUDI-5568 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > writeClient.getHoodieTable().getFileSystemView() always return the local > fileSystemView, > should use writeClient. getHoodieTable(). getHoodieView() to determine the > fileSystemView -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5568) incorrect use of FileSystemView
[ https://issues.apache.org/jira/browse/HUDI-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-5568: -- Assignee: loukey_j > incorrect use of FileSystemView > > > Key: HUDI-5568 > URL: https://issues.apache.org/jira/browse/HUDI-5568 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > writeClient.getHoodieTable().getFileSystemView() always return the local > fileSystemView, > should use writeClient. getHoodieTable(). getHoodieView() to determine the > fileSystemView -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5568) incorrect use of FileSystemView
loukey_j created HUDI-5568: -- Summary: incorrect use of FileSystemView Key: HUDI-5568 URL: https://issues.apache.org/jira/browse/HUDI-5568 Project: Apache Hudi Issue Type: Improvement Components: flink Reporter: loukey_j writeClient.getHoodieTable().getFileSystemView() always return the local fileSystemView, should use writeClient. getHoodieTable(). getHoodieView() to determine the fileSystemView -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5441) different buckets for different partitions
loukey_j created HUDI-5441: -- Summary: different buckets for different partitions Key: HUDI-5441 URL: https://issues.apache.org/jira/browse/HUDI-5441 Project: Apache Hudi Issue Type: Improvement Components: flink Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5373) Different fileids are assigned to the same bucket
loukey_j created HUDI-5373: -- Summary: Different fileids are assigned to the same bucket Key: HUDI-5373 URL: https://issues.apache.org/jira/browse/HUDI-5373 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j partition =30 bucketNum=11 bucketId = 3011 partition =301 bucketNum=1 bucketId = 3011 Different fileids are assigned to the same bucket final String bucketId = partition + bucketNum; if (incBucketIndex.contains(bucketId)) { location = new HoodieRecordLocation("I", bucketToFileId.get(bucketNum)); } else if (bucketToFileId.containsKey(bucketNum)) { location = new HoodieRecordLocation("U", bucketToFileId.get(bucketNum)); } else { String newFileId = BucketIdentifier.newBucketFileIdPrefix(bucketNum); location = new HoodieRecordLocation("I", newFileId); bucketToFileId.put(bucketNum, newFileId); incBucketIndex.add(bucketId); } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5373) Different fileids are assigned to the same bucket
[ https://issues.apache.org/jira/browse/HUDI-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-5373: -- Assignee: loukey_j > Different fileids are assigned to the same bucket > -- > > Key: HUDI-5373 > URL: https://issues.apache.org/jira/browse/HUDI-5373 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > partition =30 bucketNum=11 > bucketId = 3011 > partition =301 bucketNum=1 > bucketId = 3011 > > Different fileids are assigned to the same bucket > final String bucketId = partition + bucketNum; > if (incBucketIndex.contains(bucketId)) { > location = new HoodieRecordLocation("I", bucketToFileId.get(bucketNum)); > } else if (bucketToFileId.containsKey(bucketNum)) { > location = new HoodieRecordLocation("U", bucketToFileId.get(bucketNum)); > } else { > String newFileId = BucketIdentifier.newBucketFileIdPrefix(bucketNum); > location = new HoodieRecordLocation("I", newFileId); > bucketToFileId.put(bucketNum, newFileId); > incBucketIndex.add(bucketId); > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5240) Clean content when recursive Invocation inflate
[ https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-5240: --- Attachment: image-2022-11-18-14-57-06-393.png Description: !image-2022-11-18-14-57-06-393.png! > Clean content when recursive Invocation inflate > --- > > Key: HUDI-5240 > URL: https://issues.apache.org/jira/browse/HUDI-5240 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > Attachments: image-2022-11-18-14-57-06-393.png > > > !image-2022-11-18-14-57-06-393.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5240) Clean content when recursive Invocation inflate
[ https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-5240: -- Assignee: loukey_j > Clean content when recursive Invocation inflate > --- > > Key: HUDI-5240 > URL: https://issues.apache.org/jira/browse/HUDI-5240 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5240) Clean content when recursive Invocation inflate
loukey_j created HUDI-5240: -- Summary: Clean content when recursive Invocation inflate Key: HUDI-5240 URL: https://issues.apache.org/jira/browse/HUDI-5240 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4869) Fix test for HUDI-4780
[ https://issues.apache.org/jira/browse/HUDI-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-4869: -- Assignee: loukey_j > Fix test for HUDI-4780 > -- > > Key: HUDI-4869 > URL: https://issues.apache.org/jira/browse/HUDI-4869 > Project: Apache Hudi > Issue Type: Test >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4869) Fix test for HUDI-4780
[ https://issues.apache.org/jira/browse/HUDI-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4869: --- Summary: Fix test for HUDI-4780 (was: Add test for HUDI-4780) > Fix test for HUDI-4780 > -- > > Key: HUDI-4869 > URL: https://issues.apache.org/jira/browse/HUDI-4869 > Project: Apache Hudi > Issue Type: Test >Reporter: loukey_j >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4869) Add test for HUDI-4780
loukey_j created HUDI-4869: -- Summary: Add test for HUDI-4780 Key: HUDI-4869 URL: https://issues.apache.org/jira/browse/HUDI-4869 Project: Apache Hudi Issue Type: Test Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4790) A more effective HoodieMergeHandler for COW table with parquet
[ https://issues.apache.org/jira/browse/HUDI-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-4790: -- Assignee: loukey_j > A more effective HoodieMergeHandler for COW table with parquet > --- > > Key: HUDI-4790 > URL: https://issues.apache.org/jira/browse/HUDI-4790 > Project: Apache Hudi > Issue Type: Improvement >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4790) A more effective HoodieMergeHandler for COW table with parquet
loukey_j created HUDI-4790: -- Summary: A more effective HoodieMergeHandler for COW table with parquet Key: HUDI-4790 URL: https://issues.apache.org/jira/browse/HUDI-4790 Project: Apache Hudi Issue Type: Improvement Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4780) hoodie.logfile.max.size It does not take effect, causing the log file to be too large
[ https://issues.apache.org/jira/browse/HUDI-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-4780: -- Assignee: loukey_j > hoodie.logfile.max.size It does not take effect, causing the log file to be > too large > -- > > Key: HUDI-4780 > URL: https://issues.apache.org/jira/browse/HUDI-4780 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4780) hoodie.logfile.max.size It does not take effect, causing the log file to be too large
loukey_j created HUDI-4780: -- Summary: hoodie.logfile.max.size It does not take effect, causing the log file to be too large Key: HUDI-4780 URL: https://issues.apache.org/jira/browse/HUDI-4780 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4259) flink create avro schema not conformance to standards
[ https://issues.apache.org/jira/browse/HUDI-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4259: --- Description: [https://avro.apache.org/docs/current/spec.html#schema_complex] * A name and namespace are both specified. For example, one might use "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X. * org.apache.hudi.util.AvroSchemaConverter#convertToSchema(org.apache.flink.table.types.logical.LogicalType, java.lang.String) was: [https://avro.apache.org/docs/current/spec.html#schema_complex] * A name and namespace are both specified. For example, one might use "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X. > flink create avro schema not conformance to standards > -- > > Key: HUDI-4259 > URL: https://issues.apache.org/jira/browse/HUDI-4259 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Priority: Major > > [https://avro.apache.org/docs/current/spec.html#schema_complex] > * A name and namespace are both specified. For example, one might use > "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X. > * > org.apache.hudi.util.AvroSchemaConverter#convertToSchema(org.apache.flink.table.types.logical.LogicalType, > java.lang.String) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4259) flink create avro schema not conformance to standards
[ https://issues.apache.org/jira/browse/HUDI-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4259: --- Summary: flink create avro schema not conformance to standards (was: hudi flink convert avro schema not conformance to standards) > flink create avro schema not conformance to standards > -- > > Key: HUDI-4259 > URL: https://issues.apache.org/jira/browse/HUDI-4259 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Priority: Major > > [https://avro.apache.org/docs/current/spec.html#schema_complex] > * A name and namespace are both specified. For example, one might use > "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4260) change FlinkOptions#KEYGEN_CLASS_NAME to no default value
[ https://issues.apache.org/jira/browse/HUDI-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4260: --- Summary: change FlinkOptions#KEYGEN_CLASS_NAME to no default value (was: change FlinkOptions#KEYGEN_CLASS_NAME without a default value) > change FlinkOptions#KEYGEN_CLASS_NAME to no default value > - > > Key: HUDI-4260 > URL: https://issues.apache.org/jira/browse/HUDI-4260 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4260) change FlinkOptions#KEYGEN_CLASS_NAME without a default value
loukey_j created HUDI-4260: -- Summary: change FlinkOptions#KEYGEN_CLASS_NAME without a default value Key: HUDI-4260 URL: https://issues.apache.org/jira/browse/HUDI-4260 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4259) hudi flink convert avro schema not conformance to standards
loukey_j created HUDI-4259: -- Summary: hudi flink convert avro schema not conformance to standards Key: HUDI-4259 URL: https://issues.apache.org/jira/browse/HUDI-4259 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j [https://avro.apache.org/docs/current/spec.html#schema_complex] * A name and namespace are both specified. For example, one might use "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4133) Spark query mor by snapshot query lost data
[ https://issues.apache.org/jira/browse/HUDI-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4133: --- Summary: Spark query mor by snapshot query lost data (was: Sprak query mor by snapshot query lost data ) > Spark query mor by snapshot query lost data > - > > Key: HUDI-4133 > URL: https://issues.apache.org/jira/browse/HUDI-4133 > Project: Apache Hudi > Issue Type: Bug > Components: core, flink, spark-sql >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > Suppose there are two no intersection batches of data written to a new hudi > mor no partition table in turn by flink. > Hooide timeline and log file as follows: > > hdfs dfs -ls hdfs://xxx/mor_test/.hoodie > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema > 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp > 5291 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested > 5291 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived > 798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties > hdfs dfs -ls hdfs://xxx/mor_test/ > 13316 2022-05-21 16:42 > hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0 > 28395 2022-05-21 16:42 > hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0 > 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie > 100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata > > Use spark snapshot query execute such sql 'select distinct > _hoodie_commit_time from mor_test_rt' > Expected results is 20220521164201245 and 20220521164214473, but actual > results is 20220521164214473. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4133) Sprak query mor by snapshot query lost data
[ https://issues.apache.org/jira/browse/HUDI-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4133: --- Component/s: flink spark-sql > Sprak query mor by snapshot query lost data > - > > Key: HUDI-4133 > URL: https://issues.apache.org/jira/browse/HUDI-4133 > Project: Apache Hudi > Issue Type: Bug > Components: core, flink, spark-sql >Reporter: loukey_j >Priority: Major > > Suppose there are two no intersection batches of data written to a new hudi > mor no partition table in turn by flink. > Hooide timeline and log file as follows: > > hdfs dfs -ls hdfs://xxx/mor_test/.hoodie > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema > 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp > 5291 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested > 5291 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived > 798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties > hdfs dfs -ls hdfs://xxx/mor_test/ > 13316 2022-05-21 16:42 > hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0 > 28395 2022-05-21 16:42 > hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0 > 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie > 100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata > > Use spark snapshot query execute such sql 'select distinct > _hoodie_commit_time from mor_test_rt' > Expected results is 20220521164201245 and 20220521164214473, but actual > results is 20220521164214473. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HUDI-4133) Sprak query mor by snapshot query lost data
[ https://issues.apache.org/jira/browse/HUDI-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-4133: -- Assignee: loukey_j > Sprak query mor by snapshot query lost data > - > > Key: HUDI-4133 > URL: https://issues.apache.org/jira/browse/HUDI-4133 > Project: Apache Hudi > Issue Type: Bug > Components: core, flink, spark-sql >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > Suppose there are two no intersection batches of data written to a new hudi > mor no partition table in turn by flink. > Hooide timeline and log file as follows: > > hdfs dfs -ls hdfs://xxx/mor_test/.hoodie > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema > 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp > 5291 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested > 5291 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight > 0 2022-05-21 16:42 > hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested > 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived > 798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties > hdfs dfs -ls hdfs://xxx/mor_test/ > 13316 2022-05-21 16:42 > hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0 > 28395 2022-05-21 16:42 > hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0 > 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie > 100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata > > Use spark snapshot query execute such sql 'select distinct > _hoodie_commit_time from mor_test_rt' > Expected results is 20220521164201245 and 20220521164214473, but actual > results is 20220521164214473. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4133) Sprak query mor by snapshot query lost data
loukey_j created HUDI-4133: -- Summary: Sprak query mor by snapshot query lost data Key: HUDI-4133 URL: https://issues.apache.org/jira/browse/HUDI-4133 Project: Apache Hudi Issue Type: Bug Components: core Reporter: loukey_j Suppose there are two no intersection batches of data written to a new hudi mor no partition table in turn by flink. Hooide timeline and log file as follows: hdfs dfs -ls hdfs://xxx/mor_test/.hoodie 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp 5291 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested 5291 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested 0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived 798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties hdfs dfs -ls hdfs://xxx/mor_test/ 13316 2022-05-21 16:42 hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0 28395 2022-05-21 16:42 hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0 0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie 100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata Use spark snapshot query execute such sql 'select distinct _hoodie_commit_time from mor_test_rt' Expected results is 20220521164201245 and 20220521164214473, but actual results is 20220521164214473. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (HUDI-4108) Clean the marker files before starting new flink compaction
[ https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j resolved HUDI-4108. > Clean the marker files before starting new flink compaction > --- > > Key: HUDI-4108 > URL: https://issues.apache.org/jira/browse/HUDI-4108 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > Labels: pull-request-available > > Caused by: org.apache.hadoop.ipc.RemoteException: > /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE > for client already exists -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4108) Clean the marker files before staring new flink compaction
[ https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4108: --- Summary: Clean the marker files before staring new flink compaction (was: Mor table haven't deletes the marker directory when finished compaction in flink) > Clean the marker files before staring new flink compaction > -- > > Key: HUDI-4108 > URL: https://issues.apache.org/jira/browse/HUDI-4108 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > Caused by: org.apache.hadoop.ipc.RemoteException: > /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE > for client already exists -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4108) Clean the marker files before starting new flink compaction
[ https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4108: --- Summary: Clean the marker files before starting new flink compaction (was: Clean the marker files before staring new flink compaction) > Clean the marker files before starting new flink compaction > --- > > Key: HUDI-4108 > URL: https://issues.apache.org/jira/browse/HUDI-4108 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > Caused by: org.apache.hadoop.ipc.RemoteException: > /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE > for client already exists -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HUDI-4108) Mor table haven't deletes the marker directory when finished compaction in flink
[ https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j updated HUDI-4108: --- Description: Caused by: org.apache.hadoop.ipc.RemoteException: /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE for client already exists > Mor table haven't deletes the marker directory when finished compaction in > flink > - > > Key: HUDI-4108 > URL: https://issues.apache.org/jira/browse/HUDI-4108 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > > Caused by: org.apache.hadoop.ipc.RemoteException: > /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE > for client already exists -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HUDI-4108) Mor table haven't deletes the marker directory when finished compaction in flink
[ https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-4108: -- Assignee: loukey_j > Mor table haven't deletes the marker directory when finished compaction in > flink > - > > Key: HUDI-4108 > URL: https://issues.apache.org/jira/browse/HUDI-4108 > Project: Apache Hudi > Issue Type: Bug >Reporter: loukey_j >Assignee: loukey_j >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4108) Mor table haven't deletes the marker directory when finished compaction in flink
loukey_j created HUDI-4108: -- Summary: Mor table haven't deletes the marker directory when finished compaction in flink Key: HUDI-4108 URL: https://issues.apache.org/jira/browse/HUDI-4108 Project: Apache Hudi Issue Type: Bug Reporter: loukey_j -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HUDI-3962) flink cdc sink hudi failed to add hive partition fields for hive sync
[ https://issues.apache.org/jira/browse/HUDI-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-3962: -- Assignee: loukey_j > flink cdc sink hudi failed to add hive partition fields for hive sync > - > > Key: HUDI-3962 > URL: https://issues.apache.org/jira/browse/HUDI-3962 > Project: Apache Hudi > Issue Type: Bug >Reporter: yuehanwang >Assignee: loukey_j >Priority: Major > Labels: pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > h1. flink cdc sink hudi failed to add hive partition fields for hive sync > > Steps to reproduce the behavior: > 1. create a mysql table like : > ``` > CREATE TABLE `timeTypeTest` ( > `id` int(11) NOT NULL AUTO_INCREMENT, > `datetime1` datetime DEFAULT NULL, > `date1` date DEFAULT NULL, > `datetime16` datetime(6) DEFAULT NULL, > `time16` time DEFAULT NULL, > `timestamp16` timestamp(6) NULL DEFAULT NULL, > `timestamp16Partition` varchar(45) DEFAULT NULL, > PRIMARY KEY (`id`), > UNIQUE KEY `id_UNIQUE` (`id`) > ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1 > ``` > 2. insert a data > `insert into mydb.timeTypeTest values ('2', '2020-07-30 10:08:22', > '2020-07-30', '2020-07-30 10:08:22.00', '10:08:22', '2020-07-30 > 10:08:22.00', '2020-07-30')` > 4. start a flink cdc to sink hudi with my config properties: > ``` > --hive-sync-enable=ture > --hive-sync-jdbc-url=jdbc:hive2://localhost:1 > --hive-sync-db=testDb > --hive-sync-table=testTable > --record-key-field=id > --partition-path-field=timestamp16 > --hive-sync-partition-fields=inc_day > --hive-style-partitioning=true > --hive-sync-mode=jdbc > --hive-sync-username=hive > --hive-sync-password=hive > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS > hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd > hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true > hive_sync.partition_extractor_class=org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator > ``` > **Expected behavior** > create a hive table testTable with string partition field _inc_day_ and add a > partition "2020-07-30". But actually the partition field is _timestamp16_ > with bigint type. > ``` > show partitions testTable; "2020-07-30" > select timestamp16 from testTable; - null > ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HUDI-3962) flink cdc sink hudi failed to add hive partition fields for hive sync
[ https://issues.apache.org/jira/browse/HUDI-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-3962: -- Assignee: (was: loukey_j) > flink cdc sink hudi failed to add hive partition fields for hive sync > - > > Key: HUDI-3962 > URL: https://issues.apache.org/jira/browse/HUDI-3962 > Project: Apache Hudi > Issue Type: Bug >Reporter: yuehanwang >Priority: Major > Labels: pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > h1. flink cdc sink hudi failed to add hive partition fields for hive sync > > Steps to reproduce the behavior: > 1. create a mysql table like : > ``` > CREATE TABLE `timeTypeTest` ( > `id` int(11) NOT NULL AUTO_INCREMENT, > `datetime1` datetime DEFAULT NULL, > `date1` date DEFAULT NULL, > `datetime16` datetime(6) DEFAULT NULL, > `time16` time DEFAULT NULL, > `timestamp16` timestamp(6) NULL DEFAULT NULL, > `timestamp16Partition` varchar(45) DEFAULT NULL, > PRIMARY KEY (`id`), > UNIQUE KEY `id_UNIQUE` (`id`) > ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1 > ``` > 2. insert a data > `insert into mydb.timeTypeTest values ('2', '2020-07-30 10:08:22', > '2020-07-30', '2020-07-30 10:08:22.00', '10:08:22', '2020-07-30 > 10:08:22.00', '2020-07-30')` > 4. start a flink cdc to sink hudi with my config properties: > ``` > --hive-sync-enable=ture > --hive-sync-jdbc-url=jdbc:hive2://localhost:1 > --hive-sync-db=testDb > --hive-sync-table=testTable > --record-key-field=id > --partition-path-field=timestamp16 > --hive-sync-partition-fields=inc_day > --hive-style-partitioning=true > --hive-sync-mode=jdbc > --hive-sync-username=hive > --hive-sync-password=hive > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS > hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd > hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true > hive_sync.partition_extractor_class=org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator > ``` > **Expected behavior** > create a hive table testTable with string partition field _inc_day_ and add a > partition "2020-07-30". But actually the partition field is _timestamp16_ > with bigint type. > ``` > show partitions testTable; "2020-07-30" > select timestamp16 from testTable; - null > ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HUDI-3962) flink cdc sink hudi failed to add hive partition fields for hive sync
[ https://issues.apache.org/jira/browse/HUDI-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-3962: -- Assignee: loukey_j > flink cdc sink hudi failed to add hive partition fields for hive sync > - > > Key: HUDI-3962 > URL: https://issues.apache.org/jira/browse/HUDI-3962 > Project: Apache Hudi > Issue Type: Bug >Reporter: yuehanwang >Assignee: loukey_j >Priority: Major > Labels: pull-request-available > Original Estimate: 1h > Remaining Estimate: 1h > > h1. flink cdc sink hudi failed to add hive partition fields for hive sync > > Steps to reproduce the behavior: > 1. create a mysql table like : > ``` > CREATE TABLE `timeTypeTest` ( > `id` int(11) NOT NULL AUTO_INCREMENT, > `datetime1` datetime DEFAULT NULL, > `date1` date DEFAULT NULL, > `datetime16` datetime(6) DEFAULT NULL, > `time16` time DEFAULT NULL, > `timestamp16` timestamp(6) NULL DEFAULT NULL, > `timestamp16Partition` varchar(45) DEFAULT NULL, > PRIMARY KEY (`id`), > UNIQUE KEY `id_UNIQUE` (`id`) > ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1 > ``` > 2. insert a data > `insert into mydb.timeTypeTest values ('2', '2020-07-30 10:08:22', > '2020-07-30', '2020-07-30 10:08:22.00', '10:08:22', '2020-07-30 > 10:08:22.00', '2020-07-30')` > 4. start a flink cdc to sink hudi with my config properties: > ``` > --hive-sync-enable=ture > --hive-sync-jdbc-url=jdbc:hive2://localhost:1 > --hive-sync-db=testDb > --hive-sync-table=testTable > --record-key-field=id > --partition-path-field=timestamp16 > --hive-sync-partition-fields=inc_day > --hive-style-partitioning=true > --hive-sync-mode=jdbc > --hive-sync-username=hive > --hive-sync-password=hive > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS > hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd > hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true > hive_sync.partition_extractor_class=org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator > ``` > **Expected behavior** > create a hive table testTable with string partition field _inc_day_ and add a > partition "2020-07-30". But actually the partition field is _timestamp16_ > with bigint type. > ``` > show partitions testTable; "2020-07-30" > select timestamp16 from testTable; - null > ``` -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-3440) HoodieSortedMergeHandle#close write data disorder
loukey_j created HUDI-3440: -- Summary: HoodieSortedMergeHandle#close write data disorder Key: HUDI-3440 URL: https://issues.apache.org/jira/browse/HUDI-3440 Project: Apache Hudi Issue Type: Task Reporter: loukey_j Assignee: loukey_j Fix For: 0.11.0 newRecordKeysSorted = new PriorityQueue<>(); newRecordKeysSorted.addAll(keyToNewRecords.keySet()); newRecordKeysSorted.stream().forEach(key -> { } newRecordKeysSorted forEeach not order by the key. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-1421) Improvement of failure recovery for HoodieFlinkStreamer
[ https://issues.apache.org/jira/browse/HUDI-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j closed HUDI-1421. -- Resolution: Fixed > Improvement of failure recovery for HoodieFlinkStreamer > --- > > Key: HUDI-1421 > URL: https://issues.apache.org/jira/browse/HUDI-1421 > Project: Apache Hudi > Issue Type: Task >Reporter: wangxianghu#1 >Assignee: loukey_j >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-1931) BucketAssignFunction use wrong state
loukey_j created HUDI-1931: -- Summary: BucketAssignFunction use wrong state Key: HUDI-1931 URL: https://issues.apache.org/jira/browse/HUDI-1931 Project: Apache Hudi Issue Type: Improvement Components: Flink Integration Reporter: loukey_j Assignee: loukey_j Fix For: 0.9.0 org.apache.hudi.sink.partitioner.BucketAssignFunction#partitionLoadState and org.apache.hudi.sink.partitioner.BucketAssignFunction#indexState use wrong state, RowDataToHoodieFunction was keyby recordkey so indexState should be ValueState. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1421) Improvement of failure recovery for HoodieFlinkStreamer
[ https://issues.apache.org/jira/browse/HUDI-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] loukey_j reassigned HUDI-1421: -- Assignee: loukey_j > Improvement of failure recovery for HoodieFlinkStreamer > --- > > Key: HUDI-1421 > URL: https://issues.apache.org/jira/browse/HUDI-1421 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: loukey_j >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)