voonhous commented on code in PR #18938:
URL: https://github.com/apache/hudi/pull/18938#discussion_r3458800167
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala:
##########
@@ -142,6 +150,8 @@ class TestVariantDataType extends HoodieSparkSqlTestBase {
| primaryKey = 'id',
| type = 'mor',
| preCombineField = 'ts',
+ | hoodie.parquet.variant.write.shredding.enabled = 'true',
Review Comment:
`withRecordType` only swaps the record merger and log-block format, and
Spark compaction reads base files via `SparkReaderContextFactory` ->
`SparkFileFormatInternalRowReaderContext` (InternalRow), so
`HoodieAvroParquetReader` (where
`HoodieVariantReconstruction`/`rebuildVariantRecord` run) is never reached on
either record type.
I have added `TestSpark4VariantShreddingProvider`, a direct shred -> rebuild
round-trip over the provider covering scalar (numeric/string/boolean/decimal),
object, and array shapes - this exercises `rebuildVariantRecord` and the
`AvroVariantRow`/`AvroObjectRow`/`AvroArrayRow` accessors. The two `create()`
fail-fast branches are covered by `TestHoodieVariantReconstruction` (added for
your other comment). Also corrected the misleading comment on the SQL test to
say it covers the Spark InternalRow (native) read of a shredded base file, not
the Avro path.
Note on the Java-client suggestion: `Spark4VariantShreddingProvider` is the
only `VariantShreddingProvider`, so a Java/Flink compaction over a shredded
base file would hit the no-provider fail-fast branch rather than reconstruct -
which is why I went with the direct unit test for the reconstruction coverage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]