mgstahl-sophos opened a new issue, #18242:
URL: https://github.com/apache/hudi/issues/18242
### Bug Description
**What happened:**
Jobs using Protobuf-derived Avro schemas fail during merge with:
```
java.lang.ClassCastException: com.google.protobuf.Timestamp cannot be cast to
org.apache.avro.generic.IndexedRecord
```
hudi-utilities-bundle 1.1.1 ships 735 unshaded `com.google.protobuf.*`
classes. When HoodieMergeHelper passes the writer schema to
HoodieAvroParquetReader, parquet-avro's AvroReadSupport defaults to
SpecificData, which resolves nested record types via
`Class.forName(schema.getFullName())`. For a `google.protobuf.Timestamp` field
the full Avro name is "com.google.protobuf.Timestamp" — the unshaded class is
on the classpath, the lookup succeeds, and the returned Protobuf object does
not implement IndexedRecord, triggering the exception.
The `com.google.protobuf` shade relocation present in hudi-utilities-bundle
in 1.0.2 relocated all protobuf classes to
`org.apache.hudi.com.google.protobuf`, making
Class.forName("com.google.protobuf.Timestamp") fail gracefully and fall back to
GenericData.Record. This relocation was dropped in 1.1.1.
**What you expected:**
Merge should succeed.
**Steps to reproduce:**
1. Use hudi-utilities-bundle 1.1.1 with a Hudi table whose records contain a
Protobuf-derived Avro schema with a nested `com.google.protobuf.Timestamp`
field (Avro full name "com.google.protobuf.Timestamp"). For example.
```
{
"type": "record",
"name": "MyEvent",
"namespace": "com.example",
"fields": [
{
"name": "payload",
"type": ["null", {
"type": "record",
"name": "MyRecord",
"namespace": "com.example",
"fields": [
{"name": "id", "type": "string", "default": ""},
{
"name": "created_at",
"type": ["null", {
"type": "record",
"name": "Timestamp",
"namespace": "com.google.protobuf",
"fields": [
{"name": "seconds", "type": "long"},
{"name": "nanos", "type": "int"}
]
}],
"default": null
}
]
}],
"default": null
}
]
}
```
2. Run a DeltaStreamer ingestion job that triggers a merge (e.g.
COPY_ON_WRITE with updates).
3. Observe `ClassCastException` in the Spark driver log during
`HoodieMergeHelper.runMerge()`.
4. Workaround: Set
`parquet.avro.data.supplier=org.apache.parquet.avro.GenericDataSupplier` in
Hadoop config to bypass SpecificData entirely.
**Proposed Fix:**
Restore the com.google.protobuf relocation rule in
packaging/hudi-utilities-bundle/pom.xml:
```
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>org.apache.hudi.com.google.protobuf</shadedPattern>
</relocation>
```
### Environment
**Hudi version:** 1.1.1
### Logs and Stack Trace
```
Caused by: java.lang.ClassCastException: class com.google.protobuf.Timestamp
cannot be cast to class org.apache.avro.generic.IndexedRecord
(com.google.protobuf.Timestamp and org.apache.avro.generic.IndexedRecord are in
unnamed module of loader 'app')
at org.apache.avro.generic.GenericData.setField(GenericData.java:851)
~[avro-1.11.4.jar:1.11.4]
at
org.apache.parquet.avro.AvroRecordConverter.set(AvroRecordConverter.java:474)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.parquet.avro.AvroRecordConverter$2.add(AvroRecordConverter.java:139)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.parquet.avro.ParentValueContainer.addLong(ParentValueContainer.java:62)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.parquet.avro.AvroConverters$FieldLongConverter.addLong(AvroConverters.java:158)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.parquet.column.impl.ColumnReaderBase$2$4.writeValue(ColumnReaderBase.java:325)
~[parquet-column-1.13.1.jar:1.13.1]
at
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
~[parquet-column-1.13.1.jar:1.13.1]
at
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
~[parquet-column-1.13.1.jar:1.13.1]
at
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
~[parquet-column-1.13.1.jar:1.13.1]
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
~[parquet-hadoop-1.13.1.jar:1.13.1]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
~[parquet-hadoop-1.13.1.jar:1.13.1]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
~[parquet-hadoop-1.13.1.jar:1.13.1]
at
org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.common.table.read.buffer.KeyBasedFileGroupRecordBuffer.doHasNext(KeyBasedFileGroupRecordBuffer.java:147)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.common.table.read.buffer.FileGroupRecordBuffer.hasNext(FileGroupRecordBuffer.java:153)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.common.table.read.HoodieFileGroupReader.hasNext(HoodieFileGroupReader.java:246)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.common.table.read.HoodieFileGroupReader$HoodieFileGroupReaderIterator.hasNext(HoodieFileGroupReader.java:333)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:271)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at org.apache.hudi.io.IOUtils.runMerge(IOUtils.java:121)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:390)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:356)
~[hudi-spark3.5-bundle_2.12-1.1.1-scwx-1.jar:1.1.1-scwx-1]```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]