[ https://issues.apache.org/jira/browse/HUDI-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647735#comment-17647735 ]
Alexey Kudinkin edited comment on HUDI-5392 at 1/24/23 8:13 AM: ---------------------------------------------------------------- Another contributing issue is that when reading Bootstrap file we don't specify the expected schema and therefore records from the Bootstrap file are read in the schema decode from Parquet file. This is problematic b/c when we validate the Avro schemas their corresponding names are checked and this creates mismatches since Parquet schemas don't bear names/namespaces (of the structs) was (Author: alexey.kudinkin): Another contributing issue is that when reading Bootstrap file we don't specify the expected schema and therefore records from the Bootstrap file are read in the schema decode from file's Parquet one. This is problematic b/c when we validate the Avro schemas their corresponding names are checked and this creates mismatches since Parquet schemas don't bear names/namespaces (of the structs) > Fix Bootstrap files reader to configure arrays to be read in the new format > --------------------------------------------------------------------------- > > Key: HUDI-5392 > URL: https://issues.apache.org/jira/browse/HUDI-5392 > Project: Apache Hudi > Issue Type: Bug > Components: bootstrap > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > When writing Bootstrap file we’re using Spark writer that writes arrays in > the new format, while Hudi reads it in the old (Avro compatible) format: > {code:java} > // Old > optional group tip_history (LIST) { > repeated group array { > optional double amount; > optional binary currency (UTF8); > } > } > // new > optional group tip_history (LIST) { > repeated group list { > optional group element { > optional double amount; > optional binary currency (UTF8); > } > } > } {code} > > To fix that we need to make sure that Bootstrap files are *always* read in a > new format (Spark default) unlike Hudi's Parquet files > We also need to fix TestDataSourceForBootstrap, as it currently doesn't > actually assert that the records are written correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)