[ 
https://issues.apache.org/jira/browse/HUDI-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647735#comment-17647735
 ] 

Alexey Kudinkin edited comment on HUDI-5392 at 1/24/23 8:13 AM:
----------------------------------------------------------------

Another contributing issue is that when reading Bootstrap file we don't specify 
the expected schema and therefore records from the Bootstrap file are read in 
the schema decode from Parquet file. This is problematic b/c when we validate 
the Avro schemas their corresponding names are checked and this creates 
mismatches since Parquet schemas don't bear names/namespaces (of the structs)


was (Author: alexey.kudinkin):
Another contributing issue is that when reading Bootstrap file we don't specify 
the expected schema and therefore records from the Bootstrap file are read in 
the schema decode from file's Parquet one. This is problematic b/c when we 
validate the Avro schemas their corresponding names are checked and this 
creates mismatches since Parquet schemas don't bear names/namespaces (of the 
structs)

> Fix Bootstrap files reader to configure arrays to be read in the new format
> ---------------------------------------------------------------------------
>
>                 Key: HUDI-5392
>                 URL: https://issues.apache.org/jira/browse/HUDI-5392
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: bootstrap
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.13.0
>
>
> When writing Bootstrap file we’re using Spark writer that writes arrays in 
> the new format, while Hudi reads it in the old (Avro compatible) format:
> {code:java}
>  // Old
>  optional group tip_history (LIST) {
>     repeated group array {
>       optional double amount;
>       optional binary currency (UTF8);
>     }
>   }
>  // new
>  optional group tip_history (LIST) {
>     repeated group list {
>       optional group element {
>         optional double amount;
>         optional binary currency (UTF8);
>       }
>     }
>   } {code}
>  
> To fix that we need to make sure that Bootstrap files are *always* read in a 
> new format (Spark default) unlike Hudi's Parquet files
> We also need to fix TestDataSourceForBootstrap, as it currently doesn't 
> actually assert that the records are written correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to