Re: [PR] GH-48701: [C++][Parquet] Add ALPpd encoding [arrow]

via GitHub Wed, 06 May 2026 17:27:31 -0700


prtkgaur commented on code in PR #48345:
URL: https://github.com/apache/arrow/pull/48345#discussion_r3198257123



##########
cpp/src/arrow/util/alp/ALP_Encoding_Specification.md:
##########


Review Comment:
   Also this file will be removed once the spec in parquet format repository is 
merged.



##########
cpp/src/arrow/util/alp/alp_codec.h:
##########
@@ -0,0 +1,148 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// High-level wrapper interface for ALP compression
+
+#pragma once
+
+#include <cstddef>
+#include <optional>
+
+#include "arrow/util/alp/alp.h"
+
+namespace arrow {
+namespace util {
+namespace alp {
+
+// ----------------------------------------------------------------------
+// AlpWrapper
+
+/// \class AlpWrapper
+/// \brief High-level interface for ALP compression
+///
+/// AlpWrapper is an interface for Adaptive Lossless floating-Point Compression
+/// (ALP) (https://dl.acm.org/doi/10.1145/3626717). For encoding, it samples
+/// the data and applies decimal compression (Alp) to floating point values.
+/// This class acts as a wrapper around the vector-based interfaces of
+/// AlpSampler and Alp.
+///
+/// \tparam T the floating point type (float or double)
+template <typename T>
+class AlpWrapper {
+ public:
+  /// \brief Encode floating point values using ALP decimal compression
+  ///
+  /// \param[in] decomp pointer to the input that is to be encoded
+  /// \param[in] decomp_size size of decomp in bytes.
+  ///            This needs to be a multiple of sizeof(T).
+  /// \param[out] comp pointer to the memory region we will encode into.
+  ///             The caller is responsible for ensuring this is big enough.
+  /// \param[in,out] comp_size the actual size of the encoded data in bytes,
+  ///                expects the size of comp as input. If this is too small,
+  ///                this is set to 0 and we bail out.
+  /// \param[in] enforce_mode reserved for future use.
+  ///            Currently only AlpMode::kAlp is supported.
+  static void Encode(const T* decomp, size_t decomp_size, char* comp,
+                     size_t* comp_size,
+                     std::optional<AlpMode> enforce_mode = std::nullopt);
+
+  /// \brief Decode floating point values
+  ///
+  /// \param[out] decomp pointer to the memory region we will decode into.

Review Comment:
   yes done (taking a stab at this behavior at other places too)



##########
cpp/src/arrow/util/alp/ALP_Encoding_Specification.md:
##########


Review Comment:
   ALP is a general-purpose floating-point compression algorithm. Keeping it 
under arrow/util/alp/ makes it reusable for Arrow IPC or other formats in the 
future. The Parquet encoder/decoder in cpp/src/parquet/ already just calls into 
it as a library.
   
   With that said if we really want to move, I'd like to do that after every 
other comment is addressed (risk making all outstanding comments as outdated).



##########
cpp/src/parquet/encoder.cc:
##########
@@ -995,6 +999,90 @@ class ByteStreamSplitEncoder<FLBAType> : public 
ByteStreamSplitEncoderBase<FLBAT
   }
 };
 
+// ----------------------------------------------------------------------
+// ALP encoder (Adaptive Lossless floating-Point)
+
+template <typename DType>
+class AlpEncoder : public EncoderImpl, virtual public TypedEncoder<DType> {
+ public:
+  using T = typename DType::c_type;
+  using ArrowType = typename EncodingTraits<DType>::ArrowType;
+  using TypedEncoder<DType>::Put;
+
+  explicit AlpEncoder(const ColumnDescriptor* descr,
+                      ::arrow::MemoryPool* pool = 
::arrow::default_memory_pool())
+      : EncoderImpl(descr, Encoding::ALP, pool),
+        sink_{pool} {
+    static_assert(std::is_same<T, float>::value || std::is_same<T, 
double>::value,
+                  "ALP only supports float and double types");
+  }
+
+  int64_t EstimatedDataEncodedSize() override { return sink_.length(); }
+
+  std::shared_ptr<Buffer> FlushValues() override {
+    if (sink_.length() == 0) {
+      // Empty buffer case
+      PARQUET_ASSIGN_OR_THROW(auto buf, sink_.Finish());
+      return buf;
+    }
+
+    // Call AlpWrapper::Encode() - it handles sampling, preset selection, and 
compression
+    const size_t decompSize = sink_.length();
+    size_t compSize = 
::arrow::util::alp::AlpWrapper<T>::GetMaxCompressedSize(decompSize);
+
+    PARQUET_ASSIGN_OR_THROW(
+        auto compressed_buffer,
+        ::arrow::AllocateResizableBuffer(compSize, this->memory_pool()));
+
+    ::arrow::util::alp::AlpWrapper<T>::Encode(
+        reinterpret_cast<const T*>(sink_.data()),
+        decompSize,
+        reinterpret_cast<char*>(compressed_buffer->mutable_data()),
+        &compSize);
+
+    PARQUET_THROW_NOT_OK(compressed_buffer->Resize(compSize));
+    sink_.Reset();
+
+    return std::shared_ptr<Buffer>(std::move(compressed_buffer));
+  }
+
+  void Put(const T* buffer, int num_values) override {
+    if (num_values > 0) {
+      PARQUET_THROW_NOT_OK(

Review Comment:
   Ahh yes.
   This is on the encoding side. The current implementation buffers all Put() 
values into the sink and encodes in one shot in FlushValues().
   
   The tradeoff: ALP's sampling phase needs to see the full dataset (or a 
representative sample) to choose the best exponent/factor combination. One 
approach is to finalize the preset after the first N values (or first vector) 
arrive, then incrementally encode subsequent vectors using that preset.         
   
                                                                                
                                                                                
                                                                                
                                                                        
   I'll address this as a follow-up — the current batch approach is 
functionally correct and the incremental path requires rethinking the sampling 
strategy for streaming input.



##########
cpp/src/arrow/util/alp/alp_codec.h:
##########


Review Comment:
   done



##########
cpp/src/arrow/util/alp/alp_codec.h:
##########
@@ -0,0 +1,185 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// High-level wrapper interface for ALP compression
+
+#pragma once
+
+#include <cstddef>
+#include <optional>
+
+#include "arrow/result.h"
+#include "arrow/status.h"
+#include "arrow/util/alp/alp.h"
+#include "arrow/util/alp/alp_sampler.h"
+
+namespace arrow {
+namespace util {
+namespace alp {
+
+// ----------------------------------------------------------------------
+// AlpCodec
+
+/// \class AlpCodec
+/// \brief High-level interface for ALP compression
+///
+/// AlpCodec is an interface for Adaptive Lossless floating-Point Compression
+/// (ALP) (https://dl.acm.org/doi/10.1145/3626717). For encoding, it samples
+/// the data and applies decimal compression (Alp) to floating point values.
+/// This class acts as a wrapper around the vector-based interfaces of
+/// AlpSampler and Alp.
+///
+/// \tparam T the floating point type (float or double)
+template <typename T>
+class AlpCodec {
+ public:
+  /// Type alias for the sampler result containing encoding presets
+  using AlpSamplerResult = typename AlpSampler<T>::AlpSamplerResult;
+
+  /// \brief Create a sampling preset from input data
+  ///
+  /// This samples the input data and generates an encoding preset that can be
+  /// reused for encoding. This is useful when you want to pre-compute the 
preset
+  /// outside of the benchmark loop or encode multiple batches with the same 
preset.
+  ///
+  /// \param[in] decomp pointer to the input data to sample
+  /// \param[in] decomp_size size of decomp in bytes.
+  ///            This needs to be a multiple of sizeof(T).
+  /// \return the sampling result containing the encoding preset
+  static AlpSamplerResult CreateSamplingPreset(const T* decomp, size_t 
decomp_size);
+
+  /// \brief Encode floating point values using a pre-computed preset
+  ///
+  /// This encodes the data using a preset that was previously computed via
+  /// CreateSamplingPreset(). This avoids the sampling overhead during 
encoding.
+  ///
+  /// \param[in] decomp pointer to the input that is to be encoded
+  /// \param[in] decomp_size size of decomp in bytes.
+  ///            This needs to be a multiple of sizeof(T).
+  /// \param[out] comp pointer to the memory region we will encode into.
+  ///             Must be at least GetMaxCompressedSize(decomp_size) bytes.
+  /// \param[in,out] comp_size the actual size of the encoded data in bytes,
+  ///                expects the size of comp as input. If this is too small,
+  ///                this is set to 0 and we bail out.
+  /// \param[in] preset the pre-computed sampling result from 
CreateSamplingPreset()
+  static void EncodeWithPreset(const T* decomp, size_t decomp_size, char* comp,
+                               size_t* comp_size, const AlpSamplerResult& 
preset);
+
+  /// \brief Encode floating point values using ALP decimal compression
+  ///
+  /// \param[in] decomp pointer to the input that is to be encoded
+  /// \param[in] decomp_size size of decomp in bytes.
+  ///            This needs to be a multiple of sizeof(T).
+  /// \param[out] comp pointer to the memory region we will encode into.
+  ///             Must be at least GetMaxCompressedSize(decomp_size) bytes.
+  /// \param[in,out] comp_size the actual size of the encoded data in bytes,
+  ///                expects the size of comp as input. If this is too small,
+  ///                this is set to 0 and we bail out.
+  /// \param[in] enforce_mode reserved for future use.

Review Comment:
   Yes this is not needed. Removing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-48701: [C++][Parquet] Add ALPpd encoding [arrow]

Reply via email to