[jira] [Created] (HUDI-6532) Fix a typo in BaseFlinkCommitActionExecutor.

2023-07-14 Thread StarBoy1005 (Jira)
StarBoy1005 created HUDI-6532:
-

 Summary: Fix a typo in BaseFlinkCommitActionExecutor.
 Key: HUDI-6532
 URL: https://issues.apache.org/jira/browse/HUDI-6532
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: StarBoy1005
 Attachments: image-2023-07-14-18-06-04-273.png

Here is creating an Iterator object, I guess the word "upsetting" in exception 
is kind of misleading. 

 !image-2023-07-14-18-06-04-273.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6531) Little adjust to avoid creating an object but no need in one case

2023-07-14 Thread StarBoy1005 (Jira)
StarBoy1005 created HUDI-6531:
-

 Summary: Little adjust to avoid creating an object but no need in 
one case
 Key: HUDI-6531
 URL: https://issues.apache.org/jira/browse/HUDI-6531
 Project: Apache Hudi
  Issue Type: Improvement
  Components: metadata
Reporter: StarBoy1005
 Attachments: image-2023-07-14-15-07-52-617.png

hudi version:  0.14.0-SNAPSHOT

In default configuration , "hoodie.assume.date.partitioning" is false, so this 
new object will not in use by default. I guess we could change the spot of this 
object creating.

 !image-2023-07-14-15-07-52-617.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6190) The default value of RECORD_KEY_FIELD lead to the description in exception of checkRecordKey is not so accurate.

2023-05-08 Thread StarBoy1005 (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StarBoy1005 updated HUDI-6190:
--
Description: 
In master branch, I treat uuid as recordkey field,

{code:java}
create table test_uuid_a (
   rid bigint,
   wr_returned_date_sk bigint
)
 WITH (
'hoodie.datasource.write.keygenerator.class'='org.apache.hudi.keygen.NonpartitionedKeyGenerator',
'hoodie.datasource.write.recordkey.field'='uuid',
'connector'='hudi',
'path'='/tmp/hudi_tpcds/test_uuid_a',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='2',
'write.tasks'='2',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='2',
'index.global.enabled'='false'
);
{code}

And I got an exception:
 !image-2023-05-08-18-24-47-583.png! 

I guess the correct reason is  “uuid” field is not in the table's schema.

  was:
In master branch, I treat uuid as recordkey field,

{code:java}
create table test_uuid_a (
   rid bigint,
   wr_returned_date_sk bigint
)
 WITH (
'hoodie.datasource.write.recordkey.field'='uuid',
'connector'='hudi',
'path'='/tmp/hudi_tpcds/test_uuid_a',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='2',
'write.tasks'='2',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='2',
'index.global.enabled'='false'
);
{code}

And I got an exception:
 !image-2023-05-08-18-24-47-583.png! 

I guess the correct reason is  “uuid” field is not in the table's schema.


> The default value of RECORD_KEY_FIELD lead to the description in exception of 
> checkRecordKey is not so accurate.
> 
>
> Key: HUDI-6190
> URL: https://issues.apache.org/jira/browse/HUDI-6190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink-sql
>Reporter: StarBoy1005
>Priority: Minor
> Attachments: image-2023-05-08-18-24-47-583.png
>
>
> In master branch, I treat uuid as recordkey field,
> {code:java}
> create table test_uuid_a (
>rid bigint,
>wr_returned_date_sk bigint
> )
>  WITH (
> 'hoodie.datasource.write.keygenerator.class'='org.apache.hudi.keygen.NonpartitionedKeyGenerator',
> 'hoodie.datasource.write.recordkey.field'='uuid',
> 'connector'='hudi',
> 'path'='/tmp/hudi_tpcds/test_uuid_a',
> 'table.type'='COPY_ON_WRITE',
> 'read.start-commit'='earliest',
> 'read.streaming.enabled'='false',
> 'changelog.enabled'='true',
> 'write.precombine'='false',
> 'write.precombine.field'='no_precombine',
> 'write.operation'='bulk_insert',
> 'read.tasks'='2',
> 'write.tasks'='2',
> 'index.type'='BUCKET',
> 'metadata.enabled'='false',
> 'hoodie.bucket.index.hash.field'='rid',
> 'hoodie.bucket.index.num.buckets'='2',
> 'index.global.enabled'='false'
> );
> {code}
> And I got an exception:
>  !image-2023-05-08-18-24-47-583.png! 
> I guess the correct reason is  “uuid” field is not in the table's schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6190) The default value of RECORD_KEY_FIELD lead to the description in exception of checkRecordKey is not so accurate.

2023-05-08 Thread StarBoy1005 (Jira)
StarBoy1005 created HUDI-6190:
-

 Summary: The default value of RECORD_KEY_FIELD lead to the 
description in exception of checkRecordKey is not so accurate.
 Key: HUDI-6190
 URL: https://issues.apache.org/jira/browse/HUDI-6190
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink-sql
Reporter: StarBoy1005
 Attachments: image-2023-05-08-18-24-47-583.png

In master branch, I treat uuid as recordkey field,

{code:java}
create table test_uuid_a (
   rid bigint,
   wr_returned_date_sk bigint
)
 WITH (
'hoodie.datasource.write.recordkey.field'='uuid',
'connector'='hudi',
'path'='/tmp/hudi_tpcds/test_uuid_a',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='2',
'write.tasks'='2',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='2',
'index.global.enabled'='false'
);
{code}

And I got an exception:
 !image-2023-05-08-18-24-47-583.png! 

I guess the correct reason is  “uuid” field is not in the table's schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

2023-04-18 Thread StarBoy1005 (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713390#comment-17713390
 ] 

StarBoy1005 edited comment on HUDI-4459 at 4/18/23 6:32 AM:


Hi! I met a problem, I use flink 1.14.5 and hudi 0.13.0, read a csv file in 
hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant 
always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);


was (Author: JIRAUSER289640):
Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in 
hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant 
always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);

> Corrupt parquet file created when syncing huge table with 4000+ fields,using 
> hudi cow table with bulk_insert type
> -
>
> Key: HUDI-4459
> URL: https://issues.apache.org/jira/browse/HUDI-4459
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Leo zhang
>Assignee: Rajesh Mahindra
>Priority: Major
> Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table 
> with bulk_insert  operate type.
> The job can finished without any exception,but when I am trying to read data 
> from the table,I get empty result.The parquet file is corrupted, can't be 
> read correctly. 
> I had tried to  trace the problem, and found it was caused by SortOperator. 
> After the record is serialized in the sorter, all the field get disorder and 
> is deserialized into one field.And finally the wrong record is written into 
> parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file  in the attachments)
> 2、execute the insert statement (provided in the statement.sql file  in the 
> attachments)
> 3、execute a select statement to query hudi table  (provided in the 
> statement.sql file  in the attachments)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

2023-04-17 Thread StarBoy1005 (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713390#comment-17713390
 ] 

StarBoy1005 edited comment on HUDI-4459 at 4/18/23 5:08 AM:


Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in 
hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant 
always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);


was (Author: JIRAUSER289640):
Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in 
hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant 
always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);

> Corrupt parquet file created when syncing huge table with 4000+ fields,using 
> hudi cow table with bulk_insert type
> -
>
> Key: HUDI-4459
> URL: https://issues.apache.org/jira/browse/HUDI-4459
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Leo zhang
>Assignee: Rajesh Mahindra
>Priority: Major
> Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table 
> with bulk_insert  operate type.
> The job can finished without any exception,but when I am trying to read data 
> from the table,I get empty result.The parquet file is corrupted, can't be 
> read correctly. 
> I had tried to  trace the problem, and found it was caused by SortOperator. 
> After the record is serialized in the sorter, all the field get disorder and 
> is deserialized into one field.And finally the wrong record is written into 
> parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file  in the attachments)
> 2、execute the insert statement (provided in the statement.sql file  in the 
> attachments)
> 3、execute a select statement to query hudi table  (provided in the 
> statement.sql file  in the attachments)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

2023-04-17 Thread StarBoy1005 (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713390#comment-17713390
 ] 

StarBoy1005 commented on HUDI-4459:
---

Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in 
hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant 
always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);

> Corrupt parquet file created when syncing huge table with 4000+ fields,using 
> hudi cow table with bulk_insert type
> -
>
> Key: HUDI-4459
> URL: https://issues.apache.org/jira/browse/HUDI-4459
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Leo zhang
>Assignee: Rajesh Mahindra
>Priority: Major
> Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table 
> with bulk_insert  operate type.
> The job can finished without any exception,but when I am trying to read data 
> from the table,I get empty result.The parquet file is corrupted, can't be 
> read correctly. 
> I had tried to  trace the problem, and found it was caused by SortOperator. 
> After the record is serialized in the sorter, all the field get disorder and 
> is deserialized into one field.And finally the wrong record is written into 
> parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file  in the attachments)
> 2、execute the insert statement (provided in the statement.sql file  in the 
> attachments)
> 3、execute a select statement to query hudi table  (provided in the 
> statement.sql file  in the attachments)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)