mgstahl-sophos opened a new issue, #18242:
URL: https://github.com/apache/hudi/issues/18242

   ### Bug Description
   
   **What happened:**
   Jobs using Protobuf-derived Avro schemas fail during merge with:
   ```
   java.lang.ClassCastException: com.google.protobuf.Timestamp cannot be cast to
     org.apache.avro.generic.IndexedRecord
   ```
   hudi-utilities-bundle 1.1.1 ships 735 unshaded `com.google.protobuf.*` 
classes. When HoodieMergeHelper passes the writer schema to 
HoodieAvroParquetReader, parquet-avro's AvroReadSupport defaults to 
SpecificData, which resolves nested record types via 
`Class.forName(schema.getFullName())`. For a `google.protobuf.Timestamp` field 
the full Avro name is "com.google.protobuf.Timestamp" — the unshaded class is 
on the classpath, the lookup succeeds, and the returned Protobuf object does 
not implement IndexedRecord, triggering the exception.
   
   The `com.google.protobuf` shade relocation present in hudi-utilities-bundle 
in 1.0.2 relocated all protobuf classes to 
`org.apache.hudi.com.google.protobuf`, making 
Class.forName("com.google.protobuf.Timestamp") fail gracefully and fall back to 
GenericData.Record. This relocation was dropped in 1.1.1.
   
   **What you expected:**
   Merge should succeed. 
   
   **Steps to reproduce:**
   
   1. Use hudi-utilities-bundle 1.1.1 with a Hudi table whose records contain a 
Protobuf-derived Avro schema with a nested `com.google.protobuf.Timestamp` 
field (Avro full name "com.google.protobuf.Timestamp"). For example.
   ```
   {
     "type": "record",
     "name": "MyEvent",
     "namespace": "com.example",
     "fields": [
       {
         "name": "payload",
         "type": ["null", {
           "type": "record",
           "name": "MyRecord",
           "namespace": "com.example",
           "fields": [
             {"name": "id", "type": "string", "default": ""},
             {
               "name": "created_at",
               "type": ["null", {
                 "type": "record",
                 "name": "Timestamp",
                 "namespace": "com.google.protobuf",
                 "fields": [
                   {"name": "seconds", "type": "long"},
                   {"name": "nanos",   "type": "int"}
                 ]
               }],
               "default": null
             }
           ]
         }],
         "default": null
       }
     ]
   }
   ```
   2. Run a DeltaStreamer ingestion job that triggers a merge (e.g. 
COPY_ON_WRITE with updates).
   3. Observe `ClassCastException` in the Spark driver log during 
`HoodieMergeHelper.runMerge()`.
   4. Workaround: Set 
`parquet.avro.data.supplier=org.apache.parquet.avro.GenericDataSupplier` in 
Hadoop config to bypass SpecificData entirely.
   
   **Proposed Fix:**
   Restore the com.google.protobuf relocation rule in 
packaging/hudi-utilities-bundle/pom.xml:
   ```
   <relocation>
     <pattern>com.google.protobuf</pattern>
     <shadedPattern>org.apache.hudi.com.google.protobuf</shadedPattern>
   </relocation>
   ```
   
   ### Environment
   
   **Hudi version:** 1.1.1
   
   
   
   ### Logs and Stack Trace
   
   ```
   Caused by: java.lang.ClassCastException: class com.google.protobuf.Timestamp 
cannot be cast to class org.apache.avro.generic.IndexedRecord 
(com.google.protobuf.Timestamp and org.apache.avro.generic.IndexedRecord are in 
unnamed module of loader 'app')
        at org.apache.avro.generic.GenericData.setField(GenericData.java:851) 
~[avro-1.11.4.jar:1.11.4]
        at 
org.apache.parquet.avro.AvroRecordConverter.set(AvroRecordConverter.java:474) 
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.parquet.avro.AvroRecordConverter$2.add(AvroRecordConverter.java:139) 
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.parquet.avro.ParentValueContainer.addLong(ParentValueContainer.java:62)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.parquet.avro.AvroConverters$FieldLongConverter.addLong(AvroConverters.java:158)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.parquet.column.impl.ColumnReaderBase$2$4.writeValue(ColumnReaderBase.java:325)
 ~[parquet-column-1.13.1.jar:1.13.1]
        at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
 ~[parquet-column-1.13.1.jar:1.13.1]
        at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
 ~[parquet-column-1.13.1.jar:1.13.1]
        at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
 ~[parquet-column-1.13.1.jar:1.13.1]
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
 ~[parquet-hadoop-1.13.1.jar:1.13.1]
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
~[parquet-hadoop-1.13.1.jar:1.13.1]
        at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
~[parquet-hadoop-1.13.1.jar:1.13.1]
        at 
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.common.table.read.buffer.KeyBasedFileGroupRecordBuffer.doHasNext(KeyBasedFileGroupRecordBuffer.java:147)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.common.table.read.buffer.FileGroupRecordBuffer.hasNext(FileGroupRecordBuffer.java:153)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.hasNext(HoodieFileGroupReader.java:246)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.common.table.read.HoodieFileGroupReader$HoodieFileGroupReaderIterator.hasNext(HoodieFileGroupReader.java:333)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:271)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at org.apache.hudi.io.IOUtils.runMerge(IOUtils.java:121) 
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:390)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:356)
 ~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to