[ 
https://issues.apache.org/jira/browse/HUDI-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7874:
---------------------------------
    Labels: pull-request-available  (was: )

> Fail to read 2-level structure Parquet
> --------------------------------------
>
>                 Key: HUDI-7874
>                 URL: https://issues.apache.org/jira/browse/HUDI-7874
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Vitali Makarevich
>            Priority: Major
>              Labels: pull-request-available
>
> If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}} 
> explicitly set - to being able to write nulls inside arrays(the only way), 
> Hudi starts to write Parquets with the following schema inside:
>  {{   required group internal_list (LIST) \{
>     repeated group list {
>       required int64 element;
>     }
>   }}}
>  
> But if I had some files produced before setting 
> {{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they 
> have the following schema inside
>  {{  required group internal_list (LIST) \{
>     repeated int64 array;
>   }}}
>  
> And Hudi 0.14.x at least fails to read records from such file - failing with 
> exception
> {{Caused by: java.lang.RuntimeException: Null-value for required field: }}
> Even though the contents of arrays is {{{}not null{}}}(it cannot be null in 
> fact since Avro requires 
> {{spark.hadoop.parquet.avro.write-old-list-structure}} = {{false}} to write 
> {{{}null{}}}s.
> h3. Expected behavior
> Taken from Hudi 0.12.1(not sure what exactly broke that):
>  # If I have a file with 2 level structure and update(not matter having nulls 
> inside array or not - both produce the same) arrives with 
> "spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it 
> into 3 level.({*}fails in 0.14.1{*})
>  # If I have 3 level structure with nulls and update cames(not matter with 
> nulls or without) - read and write correctly
> The simple reproduction of issue can be found here:
> [https://github.com/VitoMakarevich/hudi-issue-014]
> Highly likely, the problem appeared after Hudi made some changes, so values 
> from Hadoop conf started to propagate into Reader instance(likely they were 
> not propagated before).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to