voonhous commented on issue #6849:
URL: https://github.com/apache/hudi/issues/6849#issuecomment-1661734683

   # Root cause
   Was looking into this issue and didn't really understand the explanation. 
For anyone that's encountering this issue and are looking for the root-cause, I 
hope this is helpful:
   
   If your Hudi **MOR** table has multiple `ARRAY<STRUCT<xxx>>` columns, you're 
encounter a `can't redefine: xxx` error when performing compaction. This error 
will ONLY be thrown if you are on Spark 3.1.
   
   This is so as Spark 3.1 + Hudi uses `parquet-avro-1.10.1`. When creating a 
`ClosableIterator<GenericRecord> baseFileRecordIterator`, a `readerSchema`, 
which is obtained by converting a `ParquetType -> Avro` using 
`parquet-avro-1.10.1:AvroSchemaConverter#convert`.
   
   This is done in the code snippet below in `ParquetUtils`.
   
   ```java
    @Override
   public Schema readAvroSchema(Configuration conf, Path parquetFilePath) {
     MessageType parquetSchema = readSchema(conf, parquetFilePath);
     return new AvroSchemaConverter(conf).convert(parquetSchema);
   }
   ```
   
   In `parquet-avro-1.10.1`, the conversion will create UNION (array) fields 
containing structs/records with the same name as such:
   
   # Column 0
   
   ```json
   
["null",{"type":"array","items":{"type":"record","name":"array","fields":[{"name":"field_0","type":["null","long"],"default":null},{"name":"field_1","type":["null","long"],"default":null},{"name":"field_2","type":["null","string"],"default":null},{"name":"field_3","type":["null","long"],"default":null}]}}]
   ```
   
   # Column 1
   
   ```json
   
["null",{"type":"array","items":{"type":"record","name":"array","fields":[{"name":"field_0","type":["null",{"type":"array","items":"long"}],"default":null},{"name":"field_1","type":["null","int"],"default":null}]}}]
   ```
   
   As can be seen the name of the record is `array` for both columns 0 and 1. 
(Which is not legal in avro's standards) and also not inline with "standard" 
that Hudi is trying to adhere too: https://github.com/apache/hudi/pull/8587
   
   
   # Fix
   While @ad1happy2go has mentioned that it works with Spark 3.2 (from my 
understanding, Spark 3.2 is using a version of `parquet-avro` with the required 
fix: https://issues.apache.org/jira/browse/PARQUET-1441), if Hudi wants to 
really fix this, the `parquet-avro` version will need to be upgraded to 
`1.11.0`. 
   
   I am not sure what the risk here is though, as we might need to shade the 
`parquet-avro` in hudi-spark-bundle as there might be class-conflict with 
Spark3.1.
   
   Using `parquet-avro-1.11.0`, the Avro schema strings will look something 
like this:
   
   # Column 0 fix
   
   ```json
   
["null",{"type":"array","items":{"type":"record","name":"array","fields":[{"name":"field_0","type":["null","long"],"default":null},{"name":"field_1","type":["null","long"],"default":null},{"name":"field_2","type":["null","string"],"default":null},{"name":"field_3","type":["null","long"],"default":null}]}}]
   ```
   
   # Column 1 fix
   
   ```json
   
["null",{"type":"array","items":{"type":"record","name":"array","namespace":"array2","fields":[{"name":"field_0","type":["null",{"type":"array","items":"long"}],"default":null},{"name":"field_1","type":["null","int"],"default":null}]}}]
   ```
   
   Notice the `namespace` key.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to