[jira] [Created] (HUDI-6532) Fix a typo in BaseFlinkCommitActionExecutor.
StarBoy1005 created HUDI-6532: - Summary: Fix a typo in BaseFlinkCommitActionExecutor. Key: HUDI-6532 URL: https://issues.apache.org/jira/browse/HUDI-6532 Project: Apache Hudi Issue Type: Improvement Components: flink Reporter: StarBoy1005 Attachments: image-2023-07-14-18-06-04-273.png Here is creating an Iterator object, I guess the word "upsetting" in exception is kind of misleading. !image-2023-07-14-18-06-04-273.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6531) Little adjust to avoid creating an object but no need in one case
StarBoy1005 created HUDI-6531: - Summary: Little adjust to avoid creating an object but no need in one case Key: HUDI-6531 URL: https://issues.apache.org/jira/browse/HUDI-6531 Project: Apache Hudi Issue Type: Improvement Components: metadata Reporter: StarBoy1005 Attachments: image-2023-07-14-15-07-52-617.png hudi version: 0.14.0-SNAPSHOT In default configuration , "hoodie.assume.date.partitioning" is false, so this new object will not in use by default. I guess we could change the spot of this object creating. !image-2023-07-14-15-07-52-617.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6190) The default value of RECORD_KEY_FIELD lead to the description in exception of checkRecordKey is not so accurate.
[ https://issues.apache.org/jira/browse/HUDI-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StarBoy1005 updated HUDI-6190: -- Description: In master branch, I treat uuid as recordkey field, {code:java} create table test_uuid_a ( rid bigint, wr_returned_date_sk bigint ) WITH ( 'hoodie.datasource.write.keygenerator.class'='org.apache.hudi.keygen.NonpartitionedKeyGenerator', 'hoodie.datasource.write.recordkey.field'='uuid', 'connector'='hudi', 'path'='/tmp/hudi_tpcds/test_uuid_a', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='bulk_insert', 'read.tasks'='2', 'write.tasks'='2', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='2', 'index.global.enabled'='false' ); {code} And I got an exception: !image-2023-05-08-18-24-47-583.png! I guess the correct reason is “uuid” field is not in the table's schema. was: In master branch, I treat uuid as recordkey field, {code:java} create table test_uuid_a ( rid bigint, wr_returned_date_sk bigint ) WITH ( 'hoodie.datasource.write.recordkey.field'='uuid', 'connector'='hudi', 'path'='/tmp/hudi_tpcds/test_uuid_a', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='bulk_insert', 'read.tasks'='2', 'write.tasks'='2', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='2', 'index.global.enabled'='false' ); {code} And I got an exception: !image-2023-05-08-18-24-47-583.png! I guess the correct reason is “uuid” field is not in the table's schema. > The default value of RECORD_KEY_FIELD lead to the description in exception of > checkRecordKey is not so accurate. > > > Key: HUDI-6190 > URL: https://issues.apache.org/jira/browse/HUDI-6190 > Project: Apache Hudi > Issue Type: Improvement > Components: flink-sql >Reporter: StarBoy1005 >Priority: Minor > Attachments: image-2023-05-08-18-24-47-583.png > > > In master branch, I treat uuid as recordkey field, > {code:java} > create table test_uuid_a ( >rid bigint, >wr_returned_date_sk bigint > ) > WITH ( > 'hoodie.datasource.write.keygenerator.class'='org.apache.hudi.keygen.NonpartitionedKeyGenerator', > 'hoodie.datasource.write.recordkey.field'='uuid', > 'connector'='hudi', > 'path'='/tmp/hudi_tpcds/test_uuid_a', > 'table.type'='COPY_ON_WRITE', > 'read.start-commit'='earliest', > 'read.streaming.enabled'='false', > 'changelog.enabled'='true', > 'write.precombine'='false', > 'write.precombine.field'='no_precombine', > 'write.operation'='bulk_insert', > 'read.tasks'='2', > 'write.tasks'='2', > 'index.type'='BUCKET', > 'metadata.enabled'='false', > 'hoodie.bucket.index.hash.field'='rid', > 'hoodie.bucket.index.num.buckets'='2', > 'index.global.enabled'='false' > ); > {code} > And I got an exception: > !image-2023-05-08-18-24-47-583.png! > I guess the correct reason is “uuid” field is not in the table's schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6190) The default value of RECORD_KEY_FIELD lead to the description in exception of checkRecordKey is not so accurate.
StarBoy1005 created HUDI-6190: - Summary: The default value of RECORD_KEY_FIELD lead to the description in exception of checkRecordKey is not so accurate. Key: HUDI-6190 URL: https://issues.apache.org/jira/browse/HUDI-6190 Project: Apache Hudi Issue Type: Improvement Components: flink-sql Reporter: StarBoy1005 Attachments: image-2023-05-08-18-24-47-583.png In master branch, I treat uuid as recordkey field, {code:java} create table test_uuid_a ( rid bigint, wr_returned_date_sk bigint ) WITH ( 'hoodie.datasource.write.recordkey.field'='uuid', 'connector'='hudi', 'path'='/tmp/hudi_tpcds/test_uuid_a', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='bulk_insert', 'read.tasks'='2', 'write.tasks'='2', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='2', 'index.global.enabled'='false' ); {code} And I got an exception: !image-2023-05-08-18-24-47-583.png! I guess the correct reason is “uuid” field is not in the table's schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type
[ https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713390#comment-17713390 ] StarBoy1005 edited comment on HUDI-4459 at 4/18/23 6:32 AM: Hi! I met a problem, I use flink 1.14.5 and hudi 0.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant always in flight state. this is my cow table ddl: create table web_returns_cow ( rid bigint PRIMARY KEY NOT ENFORCED, wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt float, wr_return_tax float, wr_return_amt_inc_tax float, wr_fee float, wr_return_ship_cost float, wr_refunded_cash float, wr_reversed_charge float, wr_account_credit float, wr_net_loss float ) PARTITIONED BY (`wr_returned_date_sk`) WITH ( 'connector'='hudi', 'path'='/tmp/data_gen/web_returns_cow', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='bulk_insert', 'read.tasks'='5', 'write.tasks'='10', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='10', 'index.global.enabled'='false' ); was (Author: JIRAUSER289640): Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant always in flight state. this is my cow table ddl: create table web_returns_cow ( rid bigint PRIMARY KEY NOT ENFORCED, wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt float, wr_return_tax float, wr_return_amt_inc_tax float, wr_fee float, wr_return_ship_cost float, wr_refunded_cash float, wr_reversed_charge float, wr_account_credit float, wr_net_loss float ) PARTITIONED BY (`wr_returned_date_sk`) WITH ( 'connector'='hudi', 'path'='/tmp/data_gen/web_returns_cow', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='bulk_insert', 'read.tasks'='5', 'write.tasks'='10', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='10', 'index.global.enabled'='false' ); > Corrupt parquet file created when syncing huge table with 4000+ fields,using > hudi cow table with bulk_insert type > - > > Key: HUDI-4459 > URL: https://issues.apache.org/jira/browse/HUDI-4459 > Project: Apache Hudi > Issue Type: Bug >Reporter: Leo zhang >Assignee: Rajesh Mahindra >Priority: Major > Attachments: statements.sql, table.ddl > > > I am trying to sync a huge table with 4000+ fields into hudi, using cow table > with bulk_insert operate type. > The job can finished without any exception,but when I am trying to read data > from the table,I get empty result.The parquet file is corrupted, can't be > read correctly. > I had tried to trace the problem, and found it was caused by SortOperator. > After the record is serialized in the sorter, all the field get disorder and > is deserialized into one field.And finally the wrong record is written into > parquet file,and make the file unreadable. > Here's a few steps to reproduce the bug in the flink sql-client: > 1、execute the table ddl(provided in the table.ddl file in the attachments) > 2、execute the insert statement (provided in the statement.sql file in the > attachments) > 3、execute a select statement to query hudi table (provided in the > statement.sql file in the attachments) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type
[ https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713390#comment-17713390 ] StarBoy1005 edited comment on HUDI-4459 at 4/18/23 5:08 AM: Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant always in flight state. this is my cow table ddl: create table web_returns_cow ( rid bigint PRIMARY KEY NOT ENFORCED, wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt float, wr_return_tax float, wr_return_amt_inc_tax float, wr_fee float, wr_return_ship_cost float, wr_refunded_cash float, wr_reversed_charge float, wr_account_credit float, wr_net_loss float ) PARTITIONED BY (`wr_returned_date_sk`) WITH ( 'connector'='hudi', 'path'='/tmp/data_gen/web_returns_cow', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='bulk_insert', 'read.tasks'='5', 'write.tasks'='10', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='10', 'index.global.enabled'='false' ); was (Author: JIRAUSER289640): Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant always in flight state. this is my cow table ddl: create table web_returns_cow ( rid bigint PRIMARY KEY NOT ENFORCED, wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt float, wr_return_tax float, wr_return_amt_inc_tax float, wr_fee float, wr_return_ship_cost float, wr_refunded_cash float, wr_reversed_charge float, wr_account_credit float, wr_net_loss float ) PARTITIONED BY (`wr_returned_date_sk`) WITH ( 'connector'='hudi', 'path'='/tmp/data_gen/web_returns_cow', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='insert', 'read.tasks'='5', 'write.tasks'='10', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='10', 'index.global.enabled'='false' ); > Corrupt parquet file created when syncing huge table with 4000+ fields,using > hudi cow table with bulk_insert type > - > > Key: HUDI-4459 > URL: https://issues.apache.org/jira/browse/HUDI-4459 > Project: Apache Hudi > Issue Type: Bug >Reporter: Leo zhang >Assignee: Rajesh Mahindra >Priority: Major > Attachments: statements.sql, table.ddl > > > I am trying to sync a huge table with 4000+ fields into hudi, using cow table > with bulk_insert operate type. > The job can finished without any exception,but when I am trying to read data > from the table,I get empty result.The parquet file is corrupted, can't be > read correctly. > I had tried to trace the problem, and found it was caused by SortOperator. > After the record is serialized in the sorter, all the field get disorder and > is deserialized into one field.And finally the wrong record is written into > parquet file,and make the file unreadable. > Here's a few steps to reproduce the bug in the flink sql-client: > 1、execute the table ddl(provided in the table.ddl file in the attachments) > 2、execute the insert statement (provided in the statement.sql file in the > attachments) > 3、execute a select statement to query hudi table (provided in the > statement.sql file in the attachments) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type
[ https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713390#comment-17713390 ] StarBoy1005 commented on HUDI-4459: --- Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant always in flight state. this is my cow table ddl: create table web_returns_cow ( rid bigint PRIMARY KEY NOT ENFORCED, wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt float, wr_return_tax float, wr_return_amt_inc_tax float, wr_fee float, wr_return_ship_cost float, wr_refunded_cash float, wr_reversed_charge float, wr_account_credit float, wr_net_loss float ) PARTITIONED BY (`wr_returned_date_sk`) WITH ( 'connector'='hudi', 'path'='/tmp/data_gen/web_returns_cow', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='insert', 'read.tasks'='5', 'write.tasks'='10', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='10', 'index.global.enabled'='false' ); > Corrupt parquet file created when syncing huge table with 4000+ fields,using > hudi cow table with bulk_insert type > - > > Key: HUDI-4459 > URL: https://issues.apache.org/jira/browse/HUDI-4459 > Project: Apache Hudi > Issue Type: Bug >Reporter: Leo zhang >Assignee: Rajesh Mahindra >Priority: Major > Attachments: statements.sql, table.ddl > > > I am trying to sync a huge table with 4000+ fields into hudi, using cow table > with bulk_insert operate type. > The job can finished without any exception,but when I am trying to read data > from the table,I get empty result.The parquet file is corrupted, can't be > read correctly. > I had tried to trace the problem, and found it was caused by SortOperator. > After the record is serialized in the sorter, all the field get disorder and > is deserialized into one field.And finally the wrong record is written into > parquet file,and make the file unreadable. > Here's a few steps to reproduce the bug in the flink sql-client: > 1、execute the table ddl(provided in the table.ddl file in the attachments) > 2、execute the insert statement (provided in the statement.sql file in the > attachments) > 3、execute a select statement to query hudi table (provided in the > statement.sql file in the attachments) -- This message was sent by Atlassian Jira (v8.20.10#820010)