iemejia opened a new pull request, #55930:
URL: https://github.com/apache/spark/pull/55930
### What changes were proposed in this pull request?
Documents and tests that Spark supports writing Parquet files with
BYTE_STREAM_SPLIT encoding for FLOAT and DOUBLE columns. This encoding
de-interleaves value bytes into per-position streams, making each stream highly
compressible -- float/double columns typically see 2-4x better compression than
PLAIN+zstd for time-series and scientific data.
No Spark code changes are needed because parquet-mr (1.17.0) already
includes the BYTE_STREAM_SPLIT encoder, and Spark's existing configuration
passthrough mechanism (`DataSourceUtils.mergeWriteOptionsIntoHadoopConf`)
already forwards arbitrary `parquet.*` properties to the writer. Setting
`parquet.enable.bytestreamsplit=true` (with dictionary disabled) activates BSS
encoding for FLOAT and DOUBLE columns.
This PR adds a test to `ParquetEncodingSuite` that:
1. Writes 8193 rows with INT32, INT64, FLOAT, DOUBLE, and nullable
FLOAT/DOUBLE columns using BSS encoding
2. Verifies via Parquet metadata that FLOAT/DOUBLE columns use
`BYTE_STREAM_SPLIT` encoding while INT32/INT64 columns do not (the boolean flag
only enables the `FLOATING_POINT` mode)
3. Reads data back and verifies round-trip correctness including null
handling
Users can enable BSS encoding via any of:
- `.option("parquet.enable.bytestreamsplit", "true")` on `DataFrameWriter`
- `withSQLConf("parquet.enable.bytestreamsplit" -> "true")`
- `spark.hadoop.parquet.enable.bytestreamsplit=true` in SparkConf
Dictionary encoding must be disabled for BSS to take effect (BSS replaces
the fallback PLAIN encoding, not dictionary encoding).
### Why are the changes needed?
BYTE_STREAM_SPLIT encoding is particularly effective for floating-point data
in time-series, scientific, and IoT workloads. The encoding is already
supported by parquet-mr and already works through Spark's config passthrough,
but this was undocumented and untested. This PR adds test coverage to prevent
regressions and documents the capability.
The read-side vectorized decoder for BYTE_STREAM_SPLIT was added in
SPARK-56894 (PR #55921).
### Does this PR introduce _any_ user-facing change?
No. The capability already works through the existing config passthrough.
This PR only adds test coverage.
### How was this patch tested?
New test case in `ParquetEncodingSuite`: "BYTE_STREAM_SPLIT encoding for
float and double columns". Tests write + metadata verification + round-trip
read correctness with nullable columns.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenCode with Claude Opus (claude-opus-4.6)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]