github-actions[bot] commented on code in PR #64723:
URL: https://github.com/apache/doris/pull/64723#discussion_r3471309043
##########
be/src/format/transformer/vparquet_transformer.cpp:
##########
@@ -123,6 +123,14 @@ void ParquetBuildHelper::build_compression_type(
builder.compression(arrow::Compression::LZ4);
break;
}
+ case TParquetCompressionType::LZ4_HADOOP: {
+ // Hadoop-framed LZ4 -> Parquet thrift codec "LZ4" (deprecated). This
matches what
+ // Spark/Iceberg writes for `write.parquet.compression-codec=lz4` and
is readable by
+ // Trino/Spark/Doris. Note arrow::Compression::LZ4 (above) instead
emits LZ4_RAW, which
+ // Trino cannot read.
+ builder.compression(arrow::Compression::LZ4_HADOOP);
Review Comment:
This still does not guarantee the Spark/Trino compatibility the comment is
relying on. Doris is pinned to Arrow 17.0.0, and Arrow's
`Lz4HadoopCodec::Compress` writes one Hadoop LZ4 block for the whole Parquet
page/dictionary page. Upstream Arrow issue apache/arrow#49641 documents that
JVM readers using parquet-mr/Hadoop fail once that block decompresses above
Hadoop's 256 KiB LZ4 buffer. Arrow's writer defaults data pages and dictionary
pages to 1MB, so normal Hive/Iceberg LZ4 files with a large page can still be
unreadable by Spark/Trino even though the footer says `LZ4`.
Please either patch/upgrade the Arrow Hadoop-LZ4 writer to split blocks, or
cap Doris' `LZ4_HADOOP` Parquet data and dictionary pages to a
Hadoop-compatible size, and add a large-page JVM/Spark/Trino coverage case. The
current tests only write one or three tiny rows and check footer metadata, so
they would not catch this compatibility failure.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]