[jira] [Comment Edited] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

StarBoy1005 (Jira) Mon, 17 Apr 2023 22:09:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713390#comment-17713390
 ]


StarBoy1005 edited comment on HUDI-4459 at 4/18/23 5:08 AM:
------------------------------------------------------------

Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in 
hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant 
always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='bulk_insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);


was (Author: JIRAUSER289640):
Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in 
hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant 
always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);

> Corrupt parquet file created when syncing huge table with 4000+ fields,using 
> hudi cow table with bulk_insert type
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-4459
>                 URL: https://issues.apache.org/jira/browse/HUDI-4459
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Leo zhang
>            Assignee: Rajesh Mahindra
>            Priority: Major
>         Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table 
> with bulk_insert  operate type.
> The job can finished without any exception,but when I am trying to read data 
> from the table,I get empty result.The parquet file is corrupted, can't be 
> read correctly. 
> I had tried to  trace the problem, and found it was caused by SortOperator. 
> After the record is serialized in the sorter, all the field get disorder and 
> is deserialized into one field.And finally the wrong record is written into 
> parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file  in the attachments)
> 2、execute the insert statement (provided in the statement.sql file  in the 
> attachments)
> 3、execute a select statement to query hudi table  (provided in the 
> statement.sql file  in the attachments)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

Reply via email to