emkornfield commented on code in PR #557:
URL: https://github.com/apache/parquet-format/pull/557#discussion_r2984885794


##########
Encodings.md:
##########
@@ -391,3 +363,522 @@ After applying the transformation, the data has the 
following representation:
 ```
 Bytes  AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
 ```
+
+<a name="ALP"></a>
+### Adaptive Lossless floating-Point: (ALP = 10)
+
+Supported Types: FLOAT, DOUBLE
+
+This encoding is adapted from the paper
+["ALP: Adaptive Lossless floating-Point 
Compression"](https://dl.acm.org/doi/10.1145/3626717)
+by Afroozeh and Boncz (SIGMOD 2024).
+
+ALP works by converting floating-point values to integers using decimal 
scaling,
+then applying Frame of Reference (FOR) encoding and bit-packing. Values that
+cannot be losslessly converted are stored as exceptions. The encoding achieves
+high compression for decimal-like floating-point data (e.g., monetary values,
+sensor readings) while remaining fully lossless.
+
+#### Overview
+
+ALP encoding consists of a page-level header followed by an offset array and 
one
+or more encoded vectors (batches of values). Each vector contains up to
+`vector_size` elements (default 1024).
+
+```
++-------------+-----------------------------+--------------------------------------+
+|   Header    |        Offset Array         |            Vector Data           
    |
+|  (7 bytes)  |   (num_vectors * 4 bytes)   |            (variable)            
    |
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+| Page Header | off0 | off1 | ... | off N-1 | Vector 0 | Vector 1 | ... | Vec 
N-1  |
+|  (7 bytes)  | (4B) | (4B) |     |  (4B)   |(variable)|(variable)|     
|(variable)|
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+```
+
+The compression pipeline for each vector is:
+
+```
+                    Input: float/double array
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  1. SAMPLING & PRESET GENERATION                         |
+    |     Sample vectors from column chunk                     |
+    |     Try all (exponent, factor) combinations              |
+    |     Select best k combinations for preset                |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  2. DECIMAL ENCODING                                     |
+    |     encoded[i] = round(value[i] * 10^e * 10^(-f))       |
+    |     Detect exceptions where decode(encode(v)) != v       |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  3. FRAME OF REFERENCE (FOR)                             |
+    |     min_val = min(encoded[])                             |
+    |     delta[i] = encoded[i] - min_val                      |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  4. BIT PACKING                                          |
+    |     bit_width = ceil(log2(max_delta + 1))                |
+    |     Pack each delta into bit_width bits                  |
+    +----------------------------------------------------------+
+                              |
+                              v
+                   Output: Serialized vector bytes
+```
+
+#### Page Layout
+
+##### Header (7 bytes)
+
+All multi-byte values are little-endian.
+
+```
+ Byte:    0              1               2              3    4    5    6
+       +----------------+---------------+--------------+----+----+----+----+
+       | compression    | integer       | log_vector   |     num_elements  |
+       | _mode          | _encoding     | _size        |     (int32 LE)    |
+       +----------------+---------------+--------------+----+----+----+----+
+```
+
+| Offset | Field | Size | Type | Description |
+|--------|-------|------|------|-------------|
+| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
+| 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + 
bit-packing) |
+| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 
15\]. Default: 10 (vector size 1024) |

Review Comment:
   ```suggestion
   | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 
15\]. Recommended default: 10 (vector size 1024) |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to