prtkgaur commented on code in PR #48345:
URL: https://github.com/apache/arrow/pull/48345#discussion_r2656886342


##########
cpp/src/arrow/util/alp/ALP_Encoding_Specification.md:
##########
@@ -0,0 +1,601 @@
+# ALP Encoding Specification
+
+*Adaptive Lossless floating-Point Compression*
+
+---
+
+## 1. Overview
+
+### 1.1 Supported Types
+
+| Data Type | Integer Type | Max Exponent | Value Range |
+|-----------|--------------|--------------|-------------|
+| FLOAT     | INT32        | 10           | +/-2,147,483,520 |
+| DOUBLE    | INT64        | 18           | +/-9.22 x 10^18 |
+
+This encoding is adapted from the Adaptive Lossless floating-Point (ALP) 
compression algorithm described in "ALP: Adaptive Lossless floating-Point 
Compression" (SIGMOD 2024, https://dl.acm.org/doi/10.1145/3626717).
+
+ALP works by converting floating-point values to integers using decimal 
scaling, then applying frame of reference (FOR) encoding and bit-packing. 
Values that cannot be losslessly converted are stored as exceptions. The 
encoding achieves high compression for decimal-like floating-point data (e.g., 
monetary values, sensor readings) while remaining fully lossless.
+
+---
+
+## 2. Data Layout
+
+ALP encoding consists of a page-level header followed by one or more encoded 
vectors. Each vector contains up to 1024 elements.

Review Comment:
   Replace 1024 with the constant specified in AlpConstant file.



##########
cpp/submodules/parquet-testing:
##########


Review Comment:
   umm ideally this submodule shouldn't be attached with this commit.
   Should revert the changes to this file.



##########
cpp/src/parquet/decoder.cc:
##########
@@ -2323,6 +2327,121 @@ class ByteStreamSplitDecoder<FLBAType> : public 
ByteStreamSplitDecoderBase<FLBAT
   }
 };
 
+// ----------------------------------------------------------------------
+// ALP decoder (Adaptive Lossless floating-Point)
+
+template <typename DType>
+class AlpDecoder : public TypedDecoderImpl<DType> {
+ public:
+  using Base = TypedDecoderImpl<DType>;
+  using T = typename DType::c_type;
+
+  explicit AlpDecoder(const ColumnDescriptor* descr)
+      : Base(descr, Encoding::ALP), current_offset_{0}, needs_decode_{false} {
+    static_assert(std::is_same<T, float>::value || std::is_same<T, 
double>::value,
+                  "ALP only supports float and double types");
+  }
+
+  void SetData(int num_values, const uint8_t* data, int len) final {
+    Base::SetData(num_values, data, len);
+    current_offset_ = 0;
+    needs_decode_ = (len > 0 && num_values > 0);
+    decoded_buffer_.clear();
+  }
+
+  int Decode(T* buffer, int max_values) override {
+    // Fast path: decode directly into output buffer if requesting all values
+    if (needs_decode_ && max_values >= this->num_values_) {
+      ::arrow::util::alp::AlpWrapper<T>::Decode(
+          buffer, static_cast<uint64_t>(this->num_values_),
+          reinterpret_cast<const char*>(this->data_), this->len_);
+
+      const int decoded = this->num_values_;
+      this->num_values_ = 0;
+      needs_decode_ = false;
+      return decoded;
+    }
+
+    // Slow path: partial read - decode to intermediate buffer
+    // ALP Bit unpacker needs batches of 64
+    if (needs_decode_) {

Review Comment:
   TODO(prateek) : check with Antoine and other reviewers if there is a way to 
relax this constraint. Though this has negligible impact on performance.



##########
cpp/src/arrow/util/alp/ALP_Encoding_Specification.md:
##########


Review Comment:
   Please check cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md for 
a more terse spec of the encoding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to