prtkgaur commented on code in PR #48345: URL: https://github.com/apache/arrow/pull/48345#discussion_r2656886342
########## cpp/src/arrow/util/alp/ALP_Encoding_Specification.md: ########## @@ -0,0 +1,601 @@ +# ALP Encoding Specification + +*Adaptive Lossless floating-Point Compression* + +--- + +## 1. Overview + +### 1.1 Supported Types + +| Data Type | Integer Type | Max Exponent | Value Range | +|-----------|--------------|--------------|-------------| +| FLOAT | INT32 | 10 | +/-2,147,483,520 | +| DOUBLE | INT64 | 18 | +/-9.22 x 10^18 | + +This encoding is adapted from the Adaptive Lossless floating-Point (ALP) compression algorithm described in "ALP: Adaptive Lossless floating-Point Compression" (SIGMOD 2024, https://dl.acm.org/doi/10.1145/3626717). + +ALP works by converting floating-point values to integers using decimal scaling, then applying frame of reference (FOR) encoding and bit-packing. Values that cannot be losslessly converted are stored as exceptions. The encoding achieves high compression for decimal-like floating-point data (e.g., monetary values, sensor readings) while remaining fully lossless. + +--- + +## 2. Data Layout + +ALP encoding consists of a page-level header followed by one or more encoded vectors. Each vector contains up to 1024 elements. Review Comment: Replace 1024 with the constant specified in AlpConstant file. ########## cpp/submodules/parquet-testing: ########## Review Comment: umm ideally this submodule shouldn't be attached with this commit. Should revert the changes to this file. ########## cpp/src/parquet/decoder.cc: ########## @@ -2323,6 +2327,121 @@ class ByteStreamSplitDecoder<FLBAType> : public ByteStreamSplitDecoderBase<FLBAT } }; +// ---------------------------------------------------------------------- +// ALP decoder (Adaptive Lossless floating-Point) + +template <typename DType> +class AlpDecoder : public TypedDecoderImpl<DType> { + public: + using Base = TypedDecoderImpl<DType>; + using T = typename DType::c_type; + + explicit AlpDecoder(const ColumnDescriptor* descr) + : Base(descr, Encoding::ALP), current_offset_{0}, needs_decode_{false} { + static_assert(std::is_same<T, float>::value || std::is_same<T, double>::value, + "ALP only supports float and double types"); + } + + void SetData(int num_values, const uint8_t* data, int len) final { + Base::SetData(num_values, data, len); + current_offset_ = 0; + needs_decode_ = (len > 0 && num_values > 0); + decoded_buffer_.clear(); + } + + int Decode(T* buffer, int max_values) override { + // Fast path: decode directly into output buffer if requesting all values + if (needs_decode_ && max_values >= this->num_values_) { + ::arrow::util::alp::AlpWrapper<T>::Decode( + buffer, static_cast<uint64_t>(this->num_values_), + reinterpret_cast<const char*>(this->data_), this->len_); + + const int decoded = this->num_values_; + this->num_values_ = 0; + needs_decode_ = false; + return decoded; + } + + // Slow path: partial read - decode to intermediate buffer + // ALP Bit unpacker needs batches of 64 + if (needs_decode_) { Review Comment: TODO(prateek) : check with Antoine and other reviewers if there is a way to relax this constraint. Though this has negligible impact on performance. ########## cpp/src/arrow/util/alp/ALP_Encoding_Specification.md: ########## Review Comment: Please check cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md for a more terse spec of the encoding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
