[jira] [Updated] (HUDI-7143) schema evolution triggers a CDC query exception

2023-11-27 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-7143:
---
Description: 
{code:sql}
sparkSession.sql("CREATE TABLE if not exists hudi_ut_schema_evolution (id INT, 
version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI 
PARTITIONED BY (inc_day) TBLPROPERTIES (hoodie.table.cdc.enabled='true', 
type='cow', primaryKey='id')");

20231127201042503.commit:
sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as id, 
1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as 
birthDate, '2023-10-01' as inc_day) s  on t.id=s.id when matched THEN UPDATE 
SET * WHEN NOT MATCHED THEN INSERT *; ");

20231127201113131.commit:
sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String 
AFTER id); ");

20231127201124255.commit:
sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select '1' as 
add1, 2 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as 
timestamp) as birthDate, '2023-10-01' as inc_day) s  on t.id=s.id when matched 
THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; ");

20231127201146659.commit:
sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1");

20231127201157382.commit:
sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 int)");

20231127201208532.commit:
sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as 
add1, 3 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as 
timestamp) as birthDate, '2023-10-01' as inc_day) s  on t.id=s.id when matched 
THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; ");

sparkSession.sql("select * from hudi_ut_schema_evolution").show(100, false);
+---+-+--+--+-+---+---+-+---++--+
|_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|id |version|name |birthDate  
|add1|inc_day   |
+---+-+--+--+-+---+---+-+---++--+
|20231127201042503  |20231127201042503_0_0|1 
|inc_day=2023-10-01
|2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|1  
|1  |str_1|2023-01-01 12:12:12|null|2023-10-01|
|20231127201124255  |20231127201124255_0_1|2 
|inc_day=2023-10-01
|2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|2  
|1  |str_1|2023-01-01 12:12:12|null|2023-10-01|
|20231127201208532  |20231127201208532_0_2|3 
|inc_day=2023-10-01
|2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|3  
|1  |str_1|2023-01-01 12:12:12|1   |2023-10-01|
+---+-+--+--+-+---+---+-+---++--+

sparkSession.sql("select * from 
hudi_table_changes('hudi_ut_schema_evolution','cdc','20231127201042503','20231127201208532')").show(100,
 false);
exception:
org.apache.avro.AvroTypeException: Found string, expecting union
at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:86)
at 
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at 

[jira] [Created] (HUDI-7143) schema evolution triggers a CDC query exception

2023-11-27 Thread loukey_j (Jira)
loukey_j created HUDI-7143:
--

 Summary: schema evolution triggers a CDC query exception
 Key: HUDI-7143
 URL: https://issues.apache.org/jira/browse/HUDI-7143
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Reporter: loukey_j



{code:sql}
sparkSession.sql("CREATE TABLE if not exists hudi_ut_schema_evolution (id INT, 
version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING HUDI 
PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', 
hoodie.table.cdc.enabled='true', type='cow', primaryKey='id')");
20231127201042503.commit:
sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as id, 
1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as 
birthDate, '2023-10-01' as inc_day) s  on t.id=s.id when matched THEN UPDATE 
SET * WHEN NOT MATCHED THEN INSERT *; ");
20231127201113131.commit:
sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String 
AFTER id); ");
20231127201124255.commit:
sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select '1' as 
add1, 2 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as 
timestamp) as birthDate, '2023-10-01' as inc_day) s  on t.id=s.id when matched 
THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; ");
20231127201146659.commit:
sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1");
20231127201157382.commit:
sparkSession.sql("ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 int)");
20231127201208532.commit:
sparkSession.sql("merge into hudi_ut_schema_evolution t using ( select 1 as 
add1, 3 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as 
timestamp) as birthDate, '2023-10-01' as inc_day) s  on t.id=s.id when matched 
THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; ");

sparkSession.sql("select * from hudi_ut_schema_evolution").show(100, false);
+---+-+--+--+-+---+---+-+---++--+
|_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|id |version|name |birthDate  
|add1|inc_day   |
+---+-+--+--+-+---+---+-+---++--+
|20231127201042503  |20231127201042503_0_0|1 
|inc_day=2023-10-01
|2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|1  
|1  |str_1|2023-01-01 12:12:12|null|2023-10-01|
|20231127201124255  |20231127201124255_0_1|2 
|inc_day=2023-10-01
|2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|2  
|1  |str_1|2023-01-01 12:12:12|null|2023-10-01|
|20231127201208532  |20231127201208532_0_2|3 
|inc_day=2023-10-01
|2fe30d70-daa3-4ebc-8dab-313116e1f8f3-0_0-103-89_20231127201208532.parquet|3  
|1  |str_1|2023-01-01 12:12:12|1   |2023-10-01|
+---+-+--+--+-+---+---+-+---++--+

sparkSession.sql("select * from 
hudi_table_changes('hudi_ut_schema_evolution','cdc','20231127201042503','20231127201208532')").show(100,
 false);

org.apache.avro.AvroTypeException: Found string, expecting union
at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:86)
at 
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at 

[jira] [Commented] (HUDI-7131) The requested schema is not compatible with the file schema

2023-11-26 Thread loukey_j (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789922#comment-17789922
 ] 

loukey_j commented on HUDI-7131:


[~xushiyan] please take a look 

> The requested schema is not compatible with the file schema
> ---
>
> Key: HUDI-7131
> URL: https://issues.apache.org/jira/browse/HUDI-7131
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: loukey_j
>Priority: Critical
>  Labels: core, merge, spark
>
> use global Index and data partition change , report an error: The requested 
> schema is not compatible with the file schema...
> Why not use the schema of 
> org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal 
> to read hudi data
>  
> CREATE TABLE if not exists unisql.hudi_ut_time_traval
> (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
> HUDI
> PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id');
> insert into unisql.hudi_ut_time_traval
> select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day;
> select * from hudi_ut_time_traval;
> +---+-+--+--++---+---+-+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno 
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id 
> |version|name |birthDate |inc_day |
> +---+-+--+--++---+---+-+---+--+
> |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 
> |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 
> |1 |str_1|2023-01-01 12:12:12|2023-10-01|
> +---+-+--+--++---+---+-+---+--+
> merge into hudi_ut_time_traval t using (
> select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day
> ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested 
> schema is not compatible with the file schema. incompatible types: required 
> int32 id != optional int32 id
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
> at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
> at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225);
> parquet schema:
> {
> "type" : "record",
> "name" : "hudi_ut_time_traval_record",
> "namespace" : "hoodie.hudi_ut_time_traval",
> "fields" : [ {
> "name" : "_hoodie_commit_time",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_commit_seqno",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_record_key",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_partition_path",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_file_name",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "id",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "version",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "name",
> "type" : [ "null", "string" ],
> "default" : null
> }, {
> "name" : "birthDate",
> "type" : [ "null", {
> "type" : "long",
> "logicalType" : "timestamp-micros"
> } ],
> "default" : null
> }, {
> "name" : "inc_day",
> "type" : [ "null", "string" ],
> "default" : null
> } ]
> }
> org.apache.hudi.io.HoodieMergedReadHandle#readerSchema:
> 

[jira] [Updated] (HUDI-7134) After deleting the field and re-executing the merge, the result is not as expected.

2023-11-22 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-7134:
---
Description: 
You can reproduce the problem by following the steps below. The value of add1 
in step 7 is not as expected.

1、CREATE TABLE if not exists hudi_ut_schema_evolution 
(id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
HUDI 
PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', 
type='cow', primaryKey='id')

2、merge into hudi_ut_schema_evolution t using ( select 1 as id, 2 as version, 
'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, 
'2023-10-02' as inc_day) s  on t.id=s.id when matched THEN UPDATE SET * WHEN 
NOT MATCHED THEN INSERT *

3、ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id);

4、merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 2 
as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as 
birthDate, '2023-10-02' as inc_day) s  on t.id=s.id when matched THEN UPDATE 
SET * WHEN NOT MATCHED THEN INSERT *
 
5、ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1;

6、select {color:red}'1' as add1{color}, 3 as id, 2 as version, 'str_1' as name, 
cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as 
inc_day) s  on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN 
INSERT *;

7、select * from hudi_ut_schema_evolution;
+---+-+--+--+-++---+---+-+---+--+
|_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|add1|id |version|name |birthDate  
|inc_day   |
+---+-+--+--+-++---+---+-+---+--+
|20231122164141030  |20231122164141030_0_0|1 
|inc_day=2023-10-02
|9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|1
  |2  |str_1|2023-01-01 12:12:12|2023-10-02|
|20231122165045413  |20231122165045413_0_1|2 
|inc_day=2023-10-02
|9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|2
  |2  |str_1|2023-01-01 12:12:12|2023-10-02|
|20231122165413036  |20231122165413036_0_2|3 
|inc_day=2023-10-02
|9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|{color:red}null{color}|3
  |2  |str_1|2023-01-01 12:12:12|2023-10-02|
+---+-+--+--+-++---+---+-+---+--+

8、show create table hudi_ut_schema_evolution;
CREATE TABLE unisql.hudi_ut_schema_evolution (
  `_hoodie_commit_time` STRING COMMENT '',
  `_hoodie_commit_seqno` STRING COMMENT '',
  `_hoodie_record_key` STRING COMMENT '',
  `_hoodie_partition_path` STRING COMMENT '',
  `_hoodie_file_name` STRING COMMENT '',
  {color:red}`add1` STRING,
  `id` INT,{color}
  `version` INT,
  `name` STRING,
  `birthDate` TIMESTAMP,
  `inc_day` STRING)
PARTITIONED BY (inc_day)
TBLPROPERTIES(
  'hoodie.query.as.ro.table' = 'false',
  'last_commit_completion_time_sync' = '20231122171640801',
  'last_commit_time_sync' = '20231122171627218',
  'primaryKey' = 'id',
  'type' = 'cow')

  was:
{code:java}
1、CREATE TABLE if not exists hudi_ut_schema_evolution 
(id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
HUDI 
PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', 
type='cow', primaryKey='id')

2、merge into hudi_ut_schema_evolution t using ( select 1 as id, 2 as version, 
'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, 
'2023-10-02' as inc_day) s  on t.id=s.id when matched THEN UPDATE SET * WHEN 
NOT MATCHED THEN INSERT *

3、ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id);

4、merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 2 
as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as 
birthDate, '2023-10-02' as inc_day) s  on t.id=s.id when matched THEN UPDATE 
SET * WHEN NOT MATCHED THEN INSERT *
 
5、ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1;

6、select {color:red}'1' as add1{color}, 3 as id, 2 as version, 'str_1' as name, 
cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as 
inc_day) s  on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN 
INSERT *;

7、select * from hudi_ut_schema_evolution;

[jira] [Created] (HUDI-7134) After deleting the field and re-executing the merge, the result is not as expected.

2023-11-22 Thread loukey_j (Jira)
loukey_j created HUDI-7134:
--

 Summary: After deleting the field and re-executing the merge, the 
result is not as expected.
 Key: HUDI-7134
 URL: https://issues.apache.org/jira/browse/HUDI-7134
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Affects Versions: 0.14.0
 Environment: hudi 0.14 spark 3.2.1
Reporter: loukey_j


{code:java}
1、CREATE TABLE if not exists hudi_ut_schema_evolution 
(id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
HUDI 
PARTITIONED BY (inc_day) TBLPROPERTIES (delta.enableChangeDataFeed='true', 
type='cow', primaryKey='id')

2、merge into hudi_ut_schema_evolution t using ( select 1 as id, 2 as version, 
'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, 
'2023-10-02' as inc_day) s  on t.id=s.id when matched THEN UPDATE SET * WHEN 
NOT MATCHED THEN INSERT *

3、ALTER TABLE hudi_ut_schema_evolution ADD COLUMNS (add1 String AFTER id);

4、merge into hudi_ut_schema_evolution t using ( select '1' as add1, 2 as id, 2 
as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as timestamp) as 
birthDate, '2023-10-02' as inc_day) s  on t.id=s.id when matched THEN UPDATE 
SET * WHEN NOT MATCHED THEN INSERT *
 
5、ALTER TABLE hudi_ut_schema_evolution DROP COLUMN add1;

6、select {color:red}'1' as add1{color}, 3 as id, 2 as version, 'str_1' as name, 
cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, '2023-10-02' as 
inc_day) s  on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN 
INSERT *;

7、select * from hudi_ut_schema_evolution;
+---+-+--+--+-++---+---+-+---+--+
|_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|add1|id |version|name |birthDate  
|inc_day   |
+---+-+--+--+-++---+---+-+---+--+
|20231122164141030  |20231122164141030_0_0|1 
|inc_day=2023-10-02
|9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|1
  |2  |str_1|2023-01-01 12:12:12|2023-10-02|
|20231122165045413  |20231122165045413_0_1|2 
|inc_day=2023-10-02
|9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|null|2
  |2  |str_1|2023-01-01 12:12:12|2023-10-02|
|20231122165413036  |20231122165413036_0_2|3 
|inc_day=2023-10-02
|9fa5823c-7e29-4330-9b05-dd72e6088d62-0_0-112-98_20231122165413036.parquet|{color:red}null{color}|3
  |2  |str_1|2023-01-01 12:12:12|2023-10-02|
+---+-+--+--+-++---+---+-+---+--+

8、show create table hudi_ut_schema_evolution;
CREATE TABLE unisql.hudi_ut_schema_evolution (
  `_hoodie_commit_time` STRING COMMENT '',
  `_hoodie_commit_seqno` STRING COMMENT '',
  `_hoodie_record_key` STRING COMMENT '',
  `_hoodie_partition_path` STRING COMMENT '',
  `_hoodie_file_name` STRING COMMENT '',
  {color:red}`add1` STRING,
  `id` INT,{color}
  `version` INT,
  `name` STRING,
  `birthDate` TIMESTAMP,
  `inc_day` STRING)
PARTITIONED BY (inc_day)
TBLPROPERTIES(
  'hoodie.query.as.ro.table' = 'false',
  'last_commit_completion_time_sync' = '20231122171640801',
  'last_commit_time_sync' = '20231122171627218',
  'primaryKey' = 'id',
  'type' = 'cow')
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7131) The requested schema is not compatible with the file schema

2023-11-22 Thread loukey_j (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788667#comment-17788667
 ] 

loukey_j commented on HUDI-7131:


sorry, I didn't notice that I converted inc_day to date type. Later I corrected 
the SQL and got the same error. Execute the following sqls to reproduce. The 
root cause of the problem is that hoodieWriteConfig.getSchema() is incompatible 
with the schema of hudi table

1. CREATE TABLE if not exists hudi_ut_time_traval
(id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
HUDI
PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id');

2. merge into hudi_ut_time_traval using (select 1 as id, 2 as version, 'str_1' 
as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, 
{color:red}'2023-10-01'{color} as inc_day) s on t.id=s.id when matched THEN 
UPDATE SET * WHEN NOT MATCHED THEN INSERT *

3. merge into hudi_ut_time_traval using (select 1 as id, 2 as version, 'str_1' 
as name, cast('2023-01-01 12:12:12.0' as timestamp) as birthDate, 
{color:red}'2023-10-02' {color}as inc_day) s on t.id=s.id when matched THEN 
UPDATE SET * WHEN NOT MATCHED THEN INSERT *

> The requested schema is not compatible with the file schema
> ---
>
> Key: HUDI-7131
> URL: https://issues.apache.org/jira/browse/HUDI-7131
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: loukey_j
>Priority: Critical
>  Labels: core, merge, spark
>
> use global Index and data partition change , report an error: The requested 
> schema is not compatible with the file schema...
> Why not use the schema of 
> org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal 
> to read hudi data
>  
> CREATE TABLE if not exists unisql.hudi_ut_time_traval
> (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
> HUDI
> PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id');
> insert into unisql.hudi_ut_time_traval
> select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day;
> select * from hudi_ut_time_traval;
> +---+-+--+--++---+---+-+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno 
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id 
> |version|name |birthDate |inc_day |
> +---+-+--+--++---+---+-+---+--+
> |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 
> |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 
> |1 |str_1|2023-01-01 12:12:12|2023-10-01|
> +---+-+--+--++---+---+-+---+--+
> merge into hudi_ut_time_traval t using (
> select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day
> ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested 
> schema is not compatible with the file schema. incompatible types: required 
> int32 id != optional int32 id
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
> at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
> at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225);
> parquet schema:
> {
> "type" : "record",
> "name" : "hudi_ut_time_traval_record",
> "namespace" : "hoodie.hudi_ut_time_traval",
> "fields" : [ {
> "name" : "_hoodie_commit_time",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_commit_seqno",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_record_key",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : 

[jira] [Commented] (HUDI-7131) The requested schema is not compatible with the file schema

2023-11-21 Thread loukey_j (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788607#comment-17788607
 ] 

loukey_j commented on HUDI-7131:


The schema of the table has not changed, only the partition value of the data 
has changed.

> The requested schema is not compatible with the file schema
> ---
>
> Key: HUDI-7131
> URL: https://issues.apache.org/jira/browse/HUDI-7131
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: loukey_j
>Priority: Critical
>  Labels: core, merge, spark
>
> use global Index and data partition change , report an error: The requested 
> schema is not compatible with the file schema...
> Why not use the schema of 
> org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal 
> to read hudi data
>  
> CREATE TABLE if not exists unisql.hudi_ut_time_traval
> (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
> HUDI
> PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id');
> insert into unisql.hudi_ut_time_traval
> select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day;
> select * from hudi_ut_time_traval;
> +---+-+--+--++---+---+-+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno 
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id 
> |version|name |birthDate |inc_day |
> +---+-+--+--++---+---+-+---+--+
> |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 
> |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 
> |1 |str_1|2023-01-01 12:12:12|2023-10-01|
> +---+-+--+--++---+---+-+---+--+
> merge into hudi_ut_time_traval t using (
> select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day
> ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested 
> schema is not compatible with the file schema. incompatible types: required 
> int32 id != optional int32 id
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
> at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
> at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225);
> parquet schema:
> {
> "type" : "record",
> "name" : "hudi_ut_time_traval_record",
> "namespace" : "hoodie.hudi_ut_time_traval",
> "fields" : [ {
> "name" : "_hoodie_commit_time",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_commit_seqno",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_record_key",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_partition_path",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_file_name",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "id",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "version",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "name",
> "type" : [ "null", "string" ],
> "default" : null
> }, {
> "name" : "birthDate",
> "type" : [ "null", {
> "type" : "long",
> "logicalType" : "timestamp-micros"
> } ],
> "default" : null
> }, {
> "name" : "inc_day",
> "type" : [ "null", "string" ],
> "default" : null
> } ]
> }
> org.apache.hudi.io.HoodieMergedReadHandle#readerSchema:
> 

[jira] [Updated] (HUDI-7131) The requested schema is not compatible with the file schema

2023-11-21 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-7131:
---
Affects Version/s: 0.14.0

> The requested schema is not compatible with the file schema
> ---
>
> Key: HUDI-7131
> URL: https://issues.apache.org/jira/browse/HUDI-7131
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: loukey_j
>Priority: Critical
>  Labels: core, merge, spark
>
> use global Index and data partition change , report an error: The requested 
> schema is not compatible with the file schema...
> Why not use the schema of 
> org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal 
> to read hudi data
>  
> CREATE TABLE if not exists unisql.hudi_ut_time_traval
> (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
> HUDI
> PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id');
> insert into unisql.hudi_ut_time_traval
> select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day;
> select * from hudi_ut_time_traval;
> +---+-+--+--++---+---+-+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno 
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id 
> |version|name |birthDate |inc_day |
> +---+-+--+--++---+---+-+---+--+
> |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 
> |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 
> |1 |str_1|2023-01-01 12:12:12|2023-10-01|
> +---+-+--+--++---+---+-+---+--+
> merge into hudi_ut_time_traval t using (
> select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day
> ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested 
> schema is not compatible with the file schema. incompatible types: required 
> int32 id != optional int32 id
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
> at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
> at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225);
> parquet schema:
> {
> "type" : "record",
> "name" : "hudi_ut_time_traval_record",
> "namespace" : "hoodie.hudi_ut_time_traval",
> "fields" : [ {
> "name" : "_hoodie_commit_time",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_commit_seqno",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_record_key",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_partition_path",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_file_name",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "id",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "version",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "name",
> "type" : [ "null", "string" ],
> "default" : null
> }, {
> "name" : "birthDate",
> "type" : [ "null", {
> "type" : "long",
> "logicalType" : "timestamp-micros"
> } ],
> "default" : null
> }, {
> "name" : "inc_day",
> "type" : [ "null", "string" ],
> "default" : null
> } ]
> }
> org.apache.hudi.io.HoodieMergedReadHandle#readerSchema:
> 

[jira] [Created] (HUDI-7131) The requested schema is not compatible with the file schema

2023-11-21 Thread loukey_j (Jira)
loukey_j created HUDI-7131:
--

 Summary: The requested schema is not compatible with the file 
schema
 Key: HUDI-7131
 URL: https://issues.apache.org/jira/browse/HUDI-7131
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j


use global Index and data partition change , report an error: The requested 
schema is not compatible with the file schema...

Why not use the schema of 
org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal to 
read hudi data

 
CREATE TABLE if not exists unisql.hudi_ut_time_traval
(id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
HUDI
PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id');

insert into unisql.hudi_ut_time_traval
select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as 
timestamp) as birthDate, cast('2023-10-01' as date) as inc_day;

select * from hudi_ut_time_traval;
+---+-+--+--++---+---+-+---+--+
|_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id |version|name 
|birthDate |inc_day |
+---+-+--+--++---+---+-+---+--+
|20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 
|8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 |1 
|str_1|2023-01-01 12:12:12|2023-10-01|
+---+-+--+--++---+---+-+---+--+

merge into hudi_ut_time_traval t using (
select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' as 
timestamp) as birthDate, cast('2023-10-02' as date) as inc_day
) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *

Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema 
is not compatible with the file schema. incompatible types: required int32 id 
!= optional int32 id
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
at 
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225);

parquet schema:
{
"type" : "record",
"name" : "hudi_ut_time_traval_record",
"namespace" : "hoodie.hudi_ut_time_traval",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "version",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "birthDate",
"type" : [ "null", {
"type" : "long",
"logicalType" : "timestamp-micros"
} ],
"default" : null
}, {
"name" : "inc_day",
"type" : [ "null", "string" ],
"default" : null
} ]
}

org.apache.hudi.io.HoodieMergedReadHandle#readerSchema:

{"type":"record","name":"hudi_ut_time_traval_record","namespace":"hoodie.hudi_ut_time_traval","fields":[\{"name":"id","type":"int"},\{"name":"version","type":"int"},\{"name":"name","type":"string"},\{"name":"birthDate","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"inc_day","type":["null",{"type":"int","logicalType":"date"}],"default":null}]}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5837) Spark ctas error

2023-02-23 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-5837:
--

Assignee: loukey_j

> Spark ctas error
> 
>
> Key: HUDI-5837
> URL: https://issues.apache.org/jira/browse/HUDI-5837
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> {color:#020f22}Error in query: Can not create the managed 
> table('default.tt_waybill_info_hudi_{color}01428095{color:#020f22}'). The 
> associated 
> location('hdfs://{color}xx{color:#020f22}/warehouse/table{color}{color:#020f22}')
>  already exists.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5837) Spark ctas error

2023-02-23 Thread loukey_j (Jira)
loukey_j created HUDI-5837:
--

 Summary: Spark ctas error
 Key: HUDI-5837
 URL: https://issues.apache.org/jira/browse/HUDI-5837
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: loukey_j


{color:#020f22}Error in query: Can not create the managed 
table('default.tt_waybill_info_hudi_{color}01428095{color:#020f22}'). The 
associated 
location('hdfs://{color}xx{color:#020f22}/warehouse/table{color}{color:#020f22}')
 already exists.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5801) Speed metaTable initializeFileGroups

2023-02-15 Thread loukey_j (Jira)
loukey_j created HUDI-5801:
--

 Summary: Speed metaTable initializeFileGroups
 Key: HUDI-5801
 URL: https://issues.apache.org/jira/browse/HUDI-5801
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: loukey_j


org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#initializeFileGroups 

Too slow when there are many filegroups

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5671) BucketIndexPartitioner partition algorithm skew

2023-01-31 Thread loukey_j (Jira)
loukey_j created HUDI-5671:
--

 Summary: BucketIndexPartitioner partition algorithm skew
 Key: HUDI-5671
 URL: https://issues.apache.org/jira/browse/HUDI-5671
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink, index
Reporter: loukey_j
 Attachments: image-2023-02-01-14-45-33-116.png, 
image-2023-02-01-14-50-29-703.png, image-2023-02-01-15-00-14-889.png, 
image-2023-02-01-15-05-15-491.png

The online job runs for 13 days and finds that there are subtasks but no data 
processing, as shown in the figure below, this job uses the update time as the 
partition, uses the bucket index, the number of buckets is 128, and the write 
parallelism is 128. The key is uniform because the file size of each bucket is 
not much different from the storage point of view. After positioning, there is 
a skew in the shuffle algorithm.

!image-2023-02-01-14-45-33-116.png!

Potential disadvantages of algorithmic tilt:
  1. The memory usage is uneven, some nodes may have high pressure on the JVM, 
and TM nodes are prone to timeout
  2. It may cause the checkpoint to time out, because the data will be flushed 
to hdfs during the snapshot state. If the skew is serious, it will cause some 
nodes to take too long and cause timeout.

current algorithm:

!image-2023-02-01-14-50-29-703.png!

Algorithm flaws:
  1. curBucket ∈ [0, numBuckets -1]
  2. For the number of globalHash values in the same partition <= numBuckets 
number, globalHash is divergent, and mod(globalHash, numPartitions) is easy to 
conflict
  3. When numBuckets is relatively large, shuffleIndex is prone to conflicts, 
resulting in skew

Algorithm optimization:

!image-2023-02-01-15-00-14-889.png!

kb = key % b; kb ∈ [0, b-1] pw = pt % w;

pw ∈ [0, w-1] shuffleIndex = (pw + kb) % w 

shuffleIndex ∈ [0, w-1] 

 

In fact, it is to calculate a pw according to the partition first. Pw can be 
understood as a slot Wn allocated to the partition. Different partitions have a 
slot.
Then move b slots back on the basis of this slot as the writing of data for 
this partition

!image-2023-02-01-15-05-15-491.png!

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5568) incorrect use of fileSystemView

2023-01-16 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-5568:
---
Summary:  incorrect use of fileSystemView  (was:  incorrect use of 
FileSystemView)

>  incorrect use of fileSystemView
> 
>
> Key: HUDI-5568
> URL: https://issues.apache.org/jira/browse/HUDI-5568
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> writeClient.getHoodieTable().getFileSystemView()  always return the local 
> fileSystemView,
> should use writeClient. getHoodieTable(). getHoodieView() to determine the 
> fileSystemView



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5568) incorrect use of FileSystemView

2023-01-16 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-5568:
--

Assignee: loukey_j

>  incorrect use of FileSystemView
> 
>
> Key: HUDI-5568
> URL: https://issues.apache.org/jira/browse/HUDI-5568
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> writeClient.getHoodieTable().getFileSystemView()  always return the local 
> fileSystemView,
> should use writeClient. getHoodieTable(). getHoodieView() to determine the 
> fileSystemView



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5568) incorrect use of FileSystemView

2023-01-16 Thread loukey_j (Jira)
loukey_j created HUDI-5568:
--

 Summary:  incorrect use of FileSystemView
 Key: HUDI-5568
 URL: https://issues.apache.org/jira/browse/HUDI-5568
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: loukey_j


writeClient.getHoodieTable().getFileSystemView()  always return the local 
fileSystemView,

should use writeClient. getHoodieTable(). getHoodieView() to determine the 
fileSystemView



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5441) different buckets for different partitions

2022-12-20 Thread loukey_j (Jira)
loukey_j created HUDI-5441:
--

 Summary: different buckets for different partitions
 Key: HUDI-5441
 URL: https://issues.apache.org/jira/browse/HUDI-5441
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5373) Different fileids are assigned to the same bucket

2022-12-12 Thread loukey_j (Jira)
loukey_j created HUDI-5373:
--

 Summary:  Different fileids are assigned to the same bucket
 Key: HUDI-5373
 URL: https://issues.apache.org/jira/browse/HUDI-5373
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j


partition =30 bucketNum=11 
bucketId = 3011

partition =301 bucketNum=1

bucketId = 3011
 
Different fileids are assigned to the same bucket

final String bucketId = partition  + bucketNum;

if (incBucketIndex.contains(bucketId)) {
location = new HoodieRecordLocation("I", bucketToFileId.get(bucketNum));
} else if (bucketToFileId.containsKey(bucketNum)) {
location = new HoodieRecordLocation("U", bucketToFileId.get(bucketNum));
} else {
String newFileId = BucketIdentifier.newBucketFileIdPrefix(bucketNum);
location = new HoodieRecordLocation("I", newFileId);
bucketToFileId.put(bucketNum, newFileId);
incBucketIndex.add(bucketId);
}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5373) Different fileids are assigned to the same bucket

2022-12-12 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-5373:
--

Assignee: loukey_j

>  Different fileids are assigned to the same bucket
> --
>
> Key: HUDI-5373
> URL: https://issues.apache.org/jira/browse/HUDI-5373
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> partition =30 bucketNum=11 
> bucketId = 3011
> partition =301 bucketNum=1
> bucketId = 3011
>  
> Different fileids are assigned to the same bucket
> final String bucketId = partition  + bucketNum;
> if (incBucketIndex.contains(bucketId)) {
> location = new HoodieRecordLocation("I", bucketToFileId.get(bucketNum));
> } else if (bucketToFileId.containsKey(bucketNum)) {
> location = new HoodieRecordLocation("U", bucketToFileId.get(bucketNum));
> } else {
> String newFileId = BucketIdentifier.newBucketFileIdPrefix(bucketNum);
> location = new HoodieRecordLocation("I", newFileId);
> bucketToFileId.put(bucketNum, newFileId);
> incBucketIndex.add(bucketId);
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5240) Clean content when recursive Invocation inflate

2022-11-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-5240:
---
 Attachment: image-2022-11-18-14-57-06-393.png
Description: !image-2022-11-18-14-57-06-393.png!

> Clean content when recursive Invocation inflate
> ---
>
> Key: HUDI-5240
> URL: https://issues.apache.org/jira/browse/HUDI-5240
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
> Attachments: image-2022-11-18-14-57-06-393.png
>
>
> !image-2022-11-18-14-57-06-393.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5240) Clean content when recursive Invocation inflate

2022-11-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-5240:
--

Assignee: loukey_j

> Clean content when recursive Invocation inflate
> ---
>
> Key: HUDI-5240
> URL: https://issues.apache.org/jira/browse/HUDI-5240
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5240) Clean content when recursive Invocation inflate

2022-11-17 Thread loukey_j (Jira)
loukey_j created HUDI-5240:
--

 Summary: Clean content when recursive Invocation inflate
 Key: HUDI-5240
 URL: https://issues.apache.org/jira/browse/HUDI-5240
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4869) Fix test for HUDI-4780

2022-09-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-4869:
--

Assignee: loukey_j

> Fix test for HUDI-4780
> --
>
> Key: HUDI-4869
> URL: https://issues.apache.org/jira/browse/HUDI-4869
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4869) Fix test for HUDI-4780

2022-09-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4869:
---
Summary: Fix test for HUDI-4780  (was: Add test for HUDI-4780)

> Fix test for HUDI-4780
> --
>
> Key: HUDI-4869
> URL: https://issues.apache.org/jira/browse/HUDI-4869
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: loukey_j
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4869) Add test for HUDI-4780

2022-09-17 Thread loukey_j (Jira)
loukey_j created HUDI-4869:
--

 Summary: Add test for HUDI-4780
 Key: HUDI-4869
 URL: https://issues.apache.org/jira/browse/HUDI-4869
 Project: Apache Hudi
  Issue Type: Test
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4790) A more effective HoodieMergeHandler for COW table with parquet

2022-09-06 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-4790:
--

Assignee: loukey_j

>  A more effective HoodieMergeHandler for COW table with parquet
> ---
>
> Key: HUDI-4790
> URL: https://issues.apache.org/jira/browse/HUDI-4790
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4790) A more effective HoodieMergeHandler for COW table with parquet

2022-09-06 Thread loukey_j (Jira)
loukey_j created HUDI-4790:
--

 Summary:  A more effective HoodieMergeHandler for COW table with 
parquet
 Key: HUDI-4790
 URL: https://issues.apache.org/jira/browse/HUDI-4790
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4780) hoodie.logfile.max.size It does not take effect, causing the log file to be too large

2022-09-05 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-4780:
--

Assignee: loukey_j

> hoodie.logfile.max.size  It does not take effect, causing the log file to be 
> too large
> --
>
> Key: HUDI-4780
> URL: https://issues.apache.org/jira/browse/HUDI-4780
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4780) hoodie.logfile.max.size It does not take effect, causing the log file to be too large

2022-09-05 Thread loukey_j (Jira)
loukey_j created HUDI-4780:
--

 Summary: hoodie.logfile.max.size  It does not take effect, causing 
the log file to be too large
 Key: HUDI-4780
 URL: https://issues.apache.org/jira/browse/HUDI-4780
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4259) flink create avro schema not conformance to standards

2022-06-15 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4259:
---
Description: 
[https://avro.apache.org/docs/current/spec.html#schema_complex]
 * A name and namespace are both specified. For example, one might use "name": 
"X", "namespace": "org.foo" to indicate the fullname org.foo.X.
 * 
org.apache.hudi.util.AvroSchemaConverter#convertToSchema(org.apache.flink.table.types.logical.LogicalType,
 java.lang.String)

  was:
[https://avro.apache.org/docs/current/spec.html#schema_complex]


 * A name and namespace are both specified. For example, one might use "name": 
"X", "namespace": "org.foo" to indicate the fullname org.foo.X.


> flink create avro schema  not conformance to standards
> --
>
> Key: HUDI-4259
> URL: https://issues.apache.org/jira/browse/HUDI-4259
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Priority: Major
>
> [https://avro.apache.org/docs/current/spec.html#schema_complex]
>  * A name and namespace are both specified. For example, one might use 
> "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X.
>  * 
> org.apache.hudi.util.AvroSchemaConverter#convertToSchema(org.apache.flink.table.types.logical.LogicalType,
>  java.lang.String)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4259) flink create avro schema not conformance to standards

2022-06-15 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4259:
---
Summary: flink create avro schema  not conformance to standards  (was: hudi 
flink convert avro schema  not conformance to standards)

> flink create avro schema  not conformance to standards
> --
>
> Key: HUDI-4259
> URL: https://issues.apache.org/jira/browse/HUDI-4259
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Priority: Major
>
> [https://avro.apache.org/docs/current/spec.html#schema_complex]
>  * A name and namespace are both specified. For example, one might use 
> "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4260) change FlinkOptions#KEYGEN_CLASS_NAME to no default value

2022-06-15 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4260:
---
Summary: change FlinkOptions#KEYGEN_CLASS_NAME to no default value  (was: 
change FlinkOptions#KEYGEN_CLASS_NAME without a default value)

> change FlinkOptions#KEYGEN_CLASS_NAME to no default value
> -
>
> Key: HUDI-4260
> URL: https://issues.apache.org/jira/browse/HUDI-4260
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4260) change FlinkOptions#KEYGEN_CLASS_NAME without a default value

2022-06-15 Thread loukey_j (Jira)
loukey_j created HUDI-4260:
--

 Summary: change FlinkOptions#KEYGEN_CLASS_NAME without a default 
value
 Key: HUDI-4260
 URL: https://issues.apache.org/jira/browse/HUDI-4260
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4259) hudi flink convert avro schema not conformance to standards

2022-06-15 Thread loukey_j (Jira)
loukey_j created HUDI-4259:
--

 Summary: hudi flink convert avro schema  not conformance to 
standards
 Key: HUDI-4259
 URL: https://issues.apache.org/jira/browse/HUDI-4259
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j


[https://avro.apache.org/docs/current/spec.html#schema_complex]


 * A name and namespace are both specified. For example, one might use "name": 
"X", "namespace": "org.foo" to indicate the fullname org.foo.X.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4133) Spark query mor by snapshot query lost data

2022-05-22 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4133:
---
Summary: Spark query  mor by snapshot query lost data   (was: Sprak query  
mor by snapshot query lost data )

> Spark query  mor by snapshot query lost data 
> -
>
> Key: HUDI-4133
> URL: https://issues.apache.org/jira/browse/HUDI-4133
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core, flink, spark-sql
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> Suppose there are two no intersection batches of data written to a new hudi  
> mor no partition table  in turn by flink.
> Hooide timeline and log file as follows:
>  
> hdfs dfs -ls hdfs://xxx/mor_test/.hoodie
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema
>      0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp
>   5291 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested
>   5291 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived
>    798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties
> hdfs dfs -ls hdfs://xxx/mor_test/
>  13316 2022-05-21 16:42 
> hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0
>  28395 2022-05-21 16:42 
> hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0
>      0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie
>    100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata
>  
> Use spark snapshot query execute such sql 'select distinct 
> _hoodie_commit_time from mor_test_rt' 
> Expected results is 20220521164201245 and 20220521164214473, but actual 
> results is 20220521164214473.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4133) Sprak query mor by snapshot query lost data

2022-05-22 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4133:
---
Component/s: flink
 spark-sql

> Sprak query  mor by snapshot query lost data 
> -
>
> Key: HUDI-4133
> URL: https://issues.apache.org/jira/browse/HUDI-4133
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core, flink, spark-sql
>Reporter: loukey_j
>Priority: Major
>
> Suppose there are two no intersection batches of data written to a new hudi  
> mor no partition table  in turn by flink.
> Hooide timeline and log file as follows:
>  
> hdfs dfs -ls hdfs://xxx/mor_test/.hoodie
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema
>      0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp
>   5291 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested
>   5291 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived
>    798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties
> hdfs dfs -ls hdfs://xxx/mor_test/
>  13316 2022-05-21 16:42 
> hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0
>  28395 2022-05-21 16:42 
> hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0
>      0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie
>    100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata
>  
> Use spark snapshot query execute such sql 'select distinct 
> _hoodie_commit_time from mor_test_rt' 
> Expected results is 20220521164201245 and 20220521164214473, but actual 
> results is 20220521164214473.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-4133) Sprak query mor by snapshot query lost data

2022-05-22 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-4133:
--

Assignee: loukey_j

> Sprak query  mor by snapshot query lost data 
> -
>
> Key: HUDI-4133
> URL: https://issues.apache.org/jira/browse/HUDI-4133
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core, flink, spark-sql
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> Suppose there are two no intersection batches of data written to a new hudi  
> mor no partition table  in turn by flink.
> Hooide timeline and log file as follows:
>  
> hdfs dfs -ls hdfs://xxx/mor_test/.hoodie
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema
>      0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp
>   5291 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested
>   5291 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight
>      0 2022-05-21 16:42 
> hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested
>      0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived
>    798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties
> hdfs dfs -ls hdfs://xxx/mor_test/
>  13316 2022-05-21 16:42 
> hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0
>  28395 2022-05-21 16:42 
> hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0
>      0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie
>    100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata
>  
> Use spark snapshot query execute such sql 'select distinct 
> _hoodie_commit_time from mor_test_rt' 
> Expected results is 20220521164201245 and 20220521164214473, but actual 
> results is 20220521164214473.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4133) Sprak query mor by snapshot query lost data

2022-05-22 Thread loukey_j (Jira)
loukey_j created HUDI-4133:
--

 Summary: Sprak query  mor by snapshot query lost data 
 Key: HUDI-4133
 URL: https://issues.apache.org/jira/browse/HUDI-4133
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Reporter: loukey_j


Suppose there are two no intersection batches of data written to a new hudi  
mor no partition table  in turn by flink.
Hooide timeline and log file as follows:
 
hdfs dfs -ls hdfs://xxx/mor_test/.hoodie
     0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.aux
     0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/.schema
     0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie/.temp
  5291 2022-05-21 16:42 
hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit
     0 2022-05-21 16:42 
hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.inflight
     0 2022-05-21 16:42 
hdfs://xxx/mor_test/.hoodie/20220521164201245.deltacommit.requested
  5291 2022-05-21 16:42 
hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit
     0 2022-05-21 16:42 
hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.inflight
     0 2022-05-21 16:42 
hdfs://xxx/mor_test/.hoodie/20220521164214473.deltacommit.requested
     0 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/archived
   798 2022-05-21 16:41 hdfs://xxx/mor_test/.hoodie/hoodie.properties

hdfs dfs -ls hdfs://xxx/mor_test/
 13316 2022-05-21 16:42 
hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164201245.log.1_0-2-0
 28395 2022-05-21 16:42 
hdfs://xxx/mor_test/.-1dd6-4395-9c90-53f8a6c6eed3_20220521164214473.log.1_0-2-0
     0 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie
   100 2022-05-21 16:42 hdfs://xxx/mor_test/.hoodie_partition_metadata
 

Use spark snapshot query execute such sql 'select distinct _hoodie_commit_time 
from mor_test_rt' 
Expected results is 20220521164201245 and 20220521164214473, but actual results 
is 20220521164214473.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HUDI-4108) Clean the marker files before starting new flink compaction

2022-05-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j resolved HUDI-4108.


> Clean the marker files before starting new flink compaction
> ---
>
> Key: HUDI-4108
> URL: https://issues.apache.org/jira/browse/HUDI-4108
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
>
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE
>  for client  already exists



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4108) Clean the marker files before staring new flink compaction

2022-05-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4108:
---
Summary: Clean the marker files before staring new flink compaction  (was: 
Mor table haven't  deletes the marker directory when finished compaction in 
flink)

> Clean the marker files before staring new flink compaction
> --
>
> Key: HUDI-4108
> URL: https://issues.apache.org/jira/browse/HUDI-4108
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE
>  for client  already exists



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4108) Clean the marker files before starting new flink compaction

2022-05-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4108:
---
Summary: Clean the marker files before starting new flink compaction  (was: 
Clean the marker files before staring new flink compaction)

> Clean the marker files before starting new flink compaction
> ---
>
> Key: HUDI-4108
> URL: https://issues.apache.org/jira/browse/HUDI-4108
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE
>  for client  already exists



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4108) Mor table haven't deletes the marker directory when finished compaction in flink

2022-05-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-4108:
---
Description: Caused by: org.apache.hadoop.ipc.RemoteException: 
/xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE
 for client  already exists

> Mor table haven't  deletes the marker directory when finished compaction in 
> flink
> -
>
> Key: HUDI-4108
> URL: https://issues.apache.org/jira/browse/HUDI-4108
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>
> Caused by: org.apache.hadoop.ipc.RemoteException: 
> /xxx/.hoodie/.temp/20220513175804790/kafka_ts=20220513/0041-c9e0-42a3-b267-28f2ada94f83_1-4-1_20220513175804790.parquet.marker.MERGE
>  for client  already exists



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-4108) Mor table haven't deletes the marker directory when finished compaction in flink

2022-05-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-4108:
--

Assignee: loukey_j

> Mor table haven't  deletes the marker directory when finished compaction in 
> flink
> -
>
> Key: HUDI-4108
> URL: https://issues.apache.org/jira/browse/HUDI-4108
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4108) Mor table haven't deletes the marker directory when finished compaction in flink

2022-05-17 Thread loukey_j (Jira)
loukey_j created HUDI-4108:
--

 Summary: Mor table haven't  deletes the marker directory when 
finished compaction in flink
 Key: HUDI-4108
 URL: https://issues.apache.org/jira/browse/HUDI-4108
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-3962) flink cdc sink hudi failed to add hive partition fields for hive sync

2022-04-24 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-3962:
--

Assignee: loukey_j

> flink cdc sink hudi failed to add hive partition fields for hive sync
> -
>
> Key: HUDI-3962
> URL: https://issues.apache.org/jira/browse/HUDI-3962
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yuehanwang
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h1. flink cdc sink hudi failed to add hive partition fields for hive sync
>  
> Steps to reproduce the behavior:
> 1. create a mysql table like :  
> ```
> CREATE TABLE `timeTypeTest` (
>   `id` int(11) NOT NULL AUTO_INCREMENT,
>   `datetime1` datetime DEFAULT NULL,
>   `date1` date DEFAULT NULL,
>   `datetime16` datetime(6) DEFAULT NULL,
>   `time16` time DEFAULT NULL,
>   `timestamp16` timestamp(6) NULL DEFAULT NULL,
>   `timestamp16Partition` varchar(45) DEFAULT NULL,
>   PRIMARY KEY (`id`),
>   UNIQUE KEY `id_UNIQUE` (`id`)
> ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
> ```
> 2. insert a data
> `insert into mydb.timeTypeTest values ('2', '2020-07-30 10:08:22', 
> '2020-07-30', '2020-07-30 10:08:22.00', '10:08:22', '2020-07-30 
> 10:08:22.00', '2020-07-30')`
> 4. start a flink cdc to sink hudi with my config properties:
> ```
> --hive-sync-enable=ture
> --hive-sync-jdbc-url=jdbc:hive2://localhost:1
> --hive-sync-db=testDb
> --hive-sync-table=testTable
> --record-key-field=id
> --partition-path-field=timestamp16
> --hive-sync-partition-fields=inc_day
> --hive-style-partitioning=true
> --hive-sync-mode=jdbc
> --hive-sync-username=hive
> --hive-sync-password=hive
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
> hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd
> hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
> hive_sync.partition_extractor_class=org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator
> ```
> **Expected behavior**
> create a hive table testTable with string partition field _inc_day_ and add a 
> partition "2020-07-30".  But actually the partition field is _timestamp16_ 
> with bigint type.
> ```
> show partitions testTable;   "2020-07-30"
> select timestamp16 from testTable; - null
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-3962) flink cdc sink hudi failed to add hive partition fields for hive sync

2022-04-24 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-3962:
--

Assignee: (was: loukey_j)

> flink cdc sink hudi failed to add hive partition fields for hive sync
> -
>
> Key: HUDI-3962
> URL: https://issues.apache.org/jira/browse/HUDI-3962
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yuehanwang
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h1. flink cdc sink hudi failed to add hive partition fields for hive sync
>  
> Steps to reproduce the behavior:
> 1. create a mysql table like :  
> ```
> CREATE TABLE `timeTypeTest` (
>   `id` int(11) NOT NULL AUTO_INCREMENT,
>   `datetime1` datetime DEFAULT NULL,
>   `date1` date DEFAULT NULL,
>   `datetime16` datetime(6) DEFAULT NULL,
>   `time16` time DEFAULT NULL,
>   `timestamp16` timestamp(6) NULL DEFAULT NULL,
>   `timestamp16Partition` varchar(45) DEFAULT NULL,
>   PRIMARY KEY (`id`),
>   UNIQUE KEY `id_UNIQUE` (`id`)
> ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
> ```
> 2. insert a data
> `insert into mydb.timeTypeTest values ('2', '2020-07-30 10:08:22', 
> '2020-07-30', '2020-07-30 10:08:22.00', '10:08:22', '2020-07-30 
> 10:08:22.00', '2020-07-30')`
> 4. start a flink cdc to sink hudi with my config properties:
> ```
> --hive-sync-enable=ture
> --hive-sync-jdbc-url=jdbc:hive2://localhost:1
> --hive-sync-db=testDb
> --hive-sync-table=testTable
> --record-key-field=id
> --partition-path-field=timestamp16
> --hive-sync-partition-fields=inc_day
> --hive-style-partitioning=true
> --hive-sync-mode=jdbc
> --hive-sync-username=hive
> --hive-sync-password=hive
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
> hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd
> hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
> hive_sync.partition_extractor_class=org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator
> ```
> **Expected behavior**
> create a hive table testTable with string partition field _inc_day_ and add a 
> partition "2020-07-30".  But actually the partition field is _timestamp16_ 
> with bigint type.
> ```
> show partitions testTable;   "2020-07-30"
> select timestamp16 from testTable; - null
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-3962) flink cdc sink hudi failed to add hive partition fields for hive sync

2022-04-24 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-3962:
--

Assignee: loukey_j

> flink cdc sink hudi failed to add hive partition fields for hive sync
> -
>
> Key: HUDI-3962
> URL: https://issues.apache.org/jira/browse/HUDI-3962
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yuehanwang
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> h1. flink cdc sink hudi failed to add hive partition fields for hive sync
>  
> Steps to reproduce the behavior:
> 1. create a mysql table like :  
> ```
> CREATE TABLE `timeTypeTest` (
>   `id` int(11) NOT NULL AUTO_INCREMENT,
>   `datetime1` datetime DEFAULT NULL,
>   `date1` date DEFAULT NULL,
>   `datetime16` datetime(6) DEFAULT NULL,
>   `time16` time DEFAULT NULL,
>   `timestamp16` timestamp(6) NULL DEFAULT NULL,
>   `timestamp16Partition` varchar(45) DEFAULT NULL,
>   PRIMARY KEY (`id`),
>   UNIQUE KEY `id_UNIQUE` (`id`)
> ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
> ```
> 2. insert a data
> `insert into mydb.timeTypeTest values ('2', '2020-07-30 10:08:22', 
> '2020-07-30', '2020-07-30 10:08:22.00', '10:08:22', '2020-07-30 
> 10:08:22.00', '2020-07-30')`
> 4. start a flink cdc to sink hudi with my config properties:
> ```
> --hive-sync-enable=ture
> --hive-sync-jdbc-url=jdbc:hive2://localhost:1
> --hive-sync-db=testDb
> --hive-sync-table=testTable
> --record-key-field=id
> --partition-path-field=timestamp16
> --hive-sync-partition-fields=inc_day
> --hive-style-partitioning=true
> --hive-sync-mode=jdbc
> --hive-sync-username=hive
> --hive-sync-password=hive
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
> hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd
> hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
> hive_sync.partition_extractor_class=org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator
> ```
> **Expected behavior**
> create a hive table testTable with string partition field _inc_day_ and add a 
> partition "2020-07-30".  But actually the partition field is _timestamp16_ 
> with bigint type.
> ```
> show partitions testTable;   "2020-07-30"
> select timestamp16 from testTable; - null
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3440) HoodieSortedMergeHandle#close write data disorder

2022-02-16 Thread loukey_j (Jira)
loukey_j created HUDI-3440:
--

 Summary: HoodieSortedMergeHandle#close write data disorder
 Key: HUDI-3440
 URL: https://issues.apache.org/jira/browse/HUDI-3440
 Project: Apache Hudi
  Issue Type: Task
Reporter: loukey_j
Assignee: loukey_j
 Fix For: 0.11.0


newRecordKeysSorted = new PriorityQueue<>();

newRecordKeysSorted.addAll(keyToNewRecords.keySet());

newRecordKeysSorted.stream().forEach(key -> {

 

} 

newRecordKeysSorted forEeach  not order by the key.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-1421) Improvement of failure recovery for HoodieFlinkStreamer

2022-02-16 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j closed HUDI-1421.
--
Resolution: Fixed

> Improvement of failure recovery for HoodieFlinkStreamer
> ---
>
> Key: HUDI-1421
> URL: https://issues.apache.org/jira/browse/HUDI-1421
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu#1
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-1931) BucketAssignFunction use wrong state

2021-05-25 Thread loukey_j (Jira)
loukey_j created HUDI-1931:
--

 Summary: BucketAssignFunction  use wrong state
 Key: HUDI-1931
 URL: https://issues.apache.org/jira/browse/HUDI-1931
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Flink Integration
Reporter: loukey_j
Assignee: loukey_j
 Fix For: 0.9.0


org.apache.hudi.sink.partitioner.BucketAssignFunction#partitionLoadState

and 

org.apache.hudi.sink.partitioner.BucketAssignFunction#indexState

use wrong state, RowDataToHoodieFunction was keyby recordkey so indexState 
should be ValueState. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1421) Improvement of failure recovery for HoodieFlinkStreamer

2020-11-29 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-1421:
--

Assignee: loukey_j

> Improvement of failure recovery for HoodieFlinkStreamer
> ---
>
> Key: HUDI-1421
> URL: https://issues.apache.org/jira/browse/HUDI-1421
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: loukey_j
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)