[ https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713390#comment-17713390 ]
StarBoy1005 edited comment on HUDI-4459 at 4/18/23 5:08 AM: ------------------------------------------------------------ Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant always in flight state. this is my cow table ddl: create table web_returns_cow ( rid bigint PRIMARY KEY NOT ENFORCED, wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt float, wr_return_tax float, wr_return_amt_inc_tax float, wr_fee float, wr_return_ship_cost float, wr_refunded_cash float, wr_reversed_charge float, wr_account_credit float, wr_net_loss float ) PARTITIONED BY (`wr_returned_date_sk`) WITH ( 'connector'='hudi', 'path'='/tmp/data_gen/web_returns_cow', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='bulk_insert', 'read.tasks'='5', 'write.tasks'='10', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='10', 'index.global.enabled'='false' ); was (Author: JIRAUSER289640): Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming mode nor batch mode, if use bulk_insert, it can‘t finish the job, the instant always in flight state. this is my cow table ddl: create table web_returns_cow ( rid bigint PRIMARY KEY NOT ENFORCED, wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt float, wr_return_tax float, wr_return_amt_inc_tax float, wr_fee float, wr_return_ship_cost float, wr_refunded_cash float, wr_reversed_charge float, wr_account_credit float, wr_net_loss float ) PARTITIONED BY (`wr_returned_date_sk`) WITH ( 'connector'='hudi', 'path'='/tmp/data_gen/web_returns_cow', 'table.type'='COPY_ON_WRITE', 'read.start-commit'='earliest', 'read.streaming.enabled'='false', 'changelog.enabled'='true', 'write.precombine'='false', 'write.precombine.field'='no_precombine', 'write.operation'='insert', 'read.tasks'='5', 'write.tasks'='10', 'index.type'='BUCKET', 'metadata.enabled'='false', 'hoodie.bucket.index.hash.field'='rid', 'hoodie.bucket.index.num.buckets'='10', 'index.global.enabled'='false' ); > Corrupt parquet file created when syncing huge table with 4000+ fields,using > hudi cow table with bulk_insert type > ----------------------------------------------------------------------------------------------------------------- > > Key: HUDI-4459 > URL: https://issues.apache.org/jira/browse/HUDI-4459 > Project: Apache Hudi > Issue Type: Bug > Reporter: Leo zhang > Assignee: Rajesh Mahindra > Priority: Major > Attachments: statements.sql, table.ddl > > > I am trying to sync a huge table with 4000+ fields into hudi, using cow table > with bulk_insert operate type. > The job can finished without any exception,but when I am trying to read data > from the table,I get empty result.The parquet file is corrupted, can't be > read correctly. > I had tried to trace the problem, and found it was caused by SortOperator. > After the record is serialized in the sorter, all the field get disorder and > is deserialized into one field.And finally the wrong record is written into > parquet file,and make the file unreadable. > Here's a few steps to reproduce the bug in the flink sql-client: > 1、execute the table ddl(provided in the table.ddl file in the attachments) > 2、execute the insert statement (provided in the statement.sql file in the > attachments) > 3、execute a select statement to query hudi table (provided in the > statement.sql file in the attachments) -- This message was sent by Atlassian Jira (v8.20.10#820010)