alamb commented on code in PR #557:
URL: https://github.com/apache/parquet-format/pull/557#discussion_r3154448378


##########
Encodings.md:
##########
@@ -391,3 +363,522 @@ After applying the transformation, the data has the 
following representation:
 ```
 Bytes  AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
 ```
+
+<a name="ALP"></a>
+### Adaptive Lossless floating-Point: (ALP = 10)
+
+Supported Types: FLOAT, DOUBLE
+
+This encoding is adapted from the paper
+["ALP: Adaptive Lossless floating-Point 
Compression"](https://dl.acm.org/doi/10.1145/3626717)
+by Afroozeh and Boncz (SIGMOD 2024).
+
+ALP works by converting floating-point values to integers using decimal 
scaling,
+then applying Frame of Reference (FOR) encoding and bit-packing. Values that
+cannot be losslessly converted are stored as exceptions. The encoding achieves
+high compression for decimal-like floating-point data (e.g., monetary values,
+sensor readings) while remaining fully lossless.

Review Comment:
   You have more summary at the end of the encoding, but I think a few more 
sentences in the intro would help people understand this more easily
   
   ```suggestion
   ALP works by converting floating-point values to integers using decimal 
scaling,
   then applying Frame of Reference (FOR) encoding and bit-packing. Values that
   cannot be losslessly converted are stored as exceptions. The encoding 
achieves
   high compression for decimal-like floating-point data (e.g., monetary values,
   sensor readings) while remaining fully lossless. Values do not depend on
   each other, which enables quick random access and parallel encode/decode. 
   ```



##########
Encodings.md:
##########
@@ -391,3 +363,522 @@ After applying the transformation, the data has the 
following representation:
 ```
 Bytes  AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
 ```
+
+<a name="ALP"></a>
+### Adaptive Lossless floating-Point: (ALP = 10)
+
+Supported Types: FLOAT, DOUBLE
+
+This encoding is adapted from the paper
+["ALP: Adaptive Lossless floating-Point 
Compression"](https://dl.acm.org/doi/10.1145/3626717)
+by Afroozeh and Boncz (SIGMOD 2024).
+
+ALP works by converting floating-point values to integers using decimal 
scaling,
+then applying Frame of Reference (FOR) encoding and bit-packing. Values that
+cannot be losslessly converted are stored as exceptions. The encoding achieves
+high compression for decimal-like floating-point data (e.g., monetary values,
+sensor readings) while remaining fully lossless.
+
+#### Overview
+
+ALP encoding consists of a page-level header followed by an offset array and 
one
+or more encoded vectors (batches of values). Each vector contains up to
+`vector_size` elements (default 1024).

Review Comment:
   I think it would help here to also define exponent (e) and factor (f) as 
they are referred to in the compression pipeline section as well as briefly 
mention the existence of exceptions here before they are described later in the 
spec
   
   ```suggestion
   ALP encoding consists of a page-level header followed by an offset array and 
one
   or more encoded vectors (batches of values). Each vector contains up to
   `vector_size` elements (default 1024), transformed first using an `exponent` 
(`e`) 
   and `factor` (`f`) and then encoded using frame-of-reference bit-packing.  
Values
   which can not be encoded losslessly, are stored as exceptions at the end of 
each vector. 
   ```



##########
Encodings.md:
##########
@@ -391,3 +363,522 @@ After applying the transformation, the data has the 
following representation:
 ```
 Bytes  AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
 ```
+
+<a name="ALP"></a>
+### Adaptive Lossless floating-Point: (ALP = 10)
+
+Supported Types: FLOAT, DOUBLE
+
+This encoding is adapted from the paper
+["ALP: Adaptive Lossless floating-Point 
Compression"](https://dl.acm.org/doi/10.1145/3626717)
+by Afroozeh and Boncz (SIGMOD 2024).
+
+ALP works by converting floating-point values to integers using decimal 
scaling,
+then applying Frame of Reference (FOR) encoding and bit-packing. Values that
+cannot be losslessly converted are stored as exceptions. The encoding achieves
+high compression for decimal-like floating-point data (e.g., monetary values,
+sensor readings) while remaining fully lossless.
+
+#### Overview
+
+ALP encoding consists of a page-level header followed by an offset array and 
one
+or more encoded vectors (batches of values). Each vector contains up to
+`vector_size` elements (default 1024).
+
+```
++-------------+-----------------------------+--------------------------------------+
+|   Header    |        Offset Array         |            Vector Data           
    |
+|  (7 bytes)  |   (num_vectors * 4 bytes)   |            (variable)            
    |
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+| Page Header | off0 | off1 | ... | off N-1 | Vector 0 | Vector 1 | ... | Vec 
N-1  |
+|  (7 bytes)  | (4B) | (4B) |     |  (4B)   |(variable)|(variable)|     
|(variable)|
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+```
+
+The compression pipeline for each vector is:
+
+```
+                    Input: float/double array
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  1. SAMPLING & PRESET GENERATION                         |
+    |     Sample vectors from column chunk                     |
+    |     Try all (exponent, factor) combinations              |
+    |     Select best k combinations for preset                |
+    +----------------------------------------------------------+

Review Comment:
   As I think was mentioned on the google doc spec, I think it would be better 
not to imply here that all implementations have to pick the exponent / factor 
by exhaustive search on a sample. It would be fine to suggest that as an 
implementation, but I think it is a particular algorithm that overly constrains 
the spec and could lead to future arguments if the spec requires this algorithm 
or not
   
   Also technically I think the sample / presets are done one per input column 
in the paper, not once per vector (as stated above)
   
   Also I think it would help to explicitly say exponent = e and factor = f 
here too.
   
   Something like this perhaps
   
   
   ```suggestion
       +----------------------------------------------------------+
       |  1. Choose exponent (e) and factor (f).                  |
       +----------------------------------------------------------+
   ```



##########
Encodings.md:
##########
@@ -391,3 +363,522 @@ After applying the transformation, the data has the 
following representation:
 ```
 Bytes  AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
 ```
+
+<a name="ALP"></a>
+### Adaptive Lossless floating-Point: (ALP = 10)
+
+Supported Types: FLOAT, DOUBLE
+
+This encoding is adapted from the paper
+["ALP: Adaptive Lossless floating-Point 
Compression"](https://dl.acm.org/doi/10.1145/3626717)
+by Afroozeh and Boncz (SIGMOD 2024).
+
+ALP works by converting floating-point values to integers using decimal 
scaling,
+then applying Frame of Reference (FOR) encoding and bit-packing. Values that
+cannot be losslessly converted are stored as exceptions. The encoding achieves
+high compression for decimal-like floating-point data (e.g., monetary values,
+sensor readings) while remaining fully lossless.
+
+#### Overview
+
+ALP encoding consists of a page-level header followed by an offset array and 
one
+or more encoded vectors (batches of values). Each vector contains up to
+`vector_size` elements (default 1024).
+
+```
++-------------+-----------------------------+--------------------------------------+
+|   Header    |        Offset Array         |            Vector Data           
    |
+|  (7 bytes)  |   (num_vectors * 4 bytes)   |            (variable)            
    |
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+| Page Header | off0 | off1 | ... | off N-1 | Vector 0 | Vector 1 | ... | Vec 
N-1  |
+|  (7 bytes)  | (4B) | (4B) |     |  (4B)   |(variable)|(variable)|     
|(variable)|
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+```
+
+The compression pipeline for each vector is:
+
+```
+                    Input: float/double array
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  1. SAMPLING & PRESET GENERATION                         |
+    |     Sample vectors from column chunk                     |
+    |     Try all (exponent, factor) combinations              |
+    |     Select best k combinations for preset                |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  2. DECIMAL ENCODING                                     |
+    |     encoded[i] = round(value[i] * 10^e * 10^(-f))       |
+    |     Detect exceptions where decode(encode(v)) != v       |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  3. FRAME OF REFERENCE (FOR)                             |
+    |     min_val = min(encoded[])                             |
+    |     delta[i] = encoded[i] - min_val                      |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  4. BIT PACKING                                          |
+    |     bit_width = ceil(log2(max_delta + 1))                |
+    |     Pack each delta into bit_width bits                  |
+    +----------------------------------------------------------+
+                              |
+                              v
+                   Output: Serialized vector bytes
+```
+
+#### Page Layout
+
+##### Header (7 bytes)
+
+All multi-byte values are little-endian.
+
+```
+ Byte:    0              1               2              3    4    5    6
+       +----------------+---------------+--------------+----+----+----+----+
+       | compression    | integer       | log_vector   |     num_elements  |
+       | _mode          | _encoding     | _size        |     (int32 LE)    |
+       +----------------+---------------+--------------+----+----+----+----+
+```
+
+| Offset | Field | Size | Type | Description |
+|--------|-------|------|------|-------------|
+| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
+| 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + 
bit-packing) |
+| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 
15\]. Default: 10 (vector size 1024) |
+| 3 | num_elements | 4 bytes | int32 | Total number of floating-point values 
in the page |
+
+The number of vectors is `ceil(num_elements / vector_size)`. The last vector 
may
+contain fewer than `vector_size` elements.
+
+**Note:** The number of elements per vector and the packed data size are NOT 
stored
+in the header. They are derived:
+* Elements per vector: `vector_size` for all vectors except the last, which 
may be smaller.

Review Comment:
   I agree this is redundant -- I think leaving the note is valuable 
clarification
   
   > **Note:** The number of elements per vector and the packed data size are 
NOT stored
   in the header, they are derived
   
   However the other items are not



##########
Encodings.md:
##########
@@ -391,3 +363,522 @@ After applying the transformation, the data has the 
following representation:
 ```
 Bytes  AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6
 ```
+
+<a name="ALP"></a>
+### Adaptive Lossless floating-Point: (ALP = 10)
+
+Supported Types: FLOAT, DOUBLE
+
+This encoding is adapted from the paper
+["ALP: Adaptive Lossless floating-Point 
Compression"](https://dl.acm.org/doi/10.1145/3626717)
+by Afroozeh and Boncz (SIGMOD 2024).
+
+ALP works by converting floating-point values to integers using decimal 
scaling,
+then applying Frame of Reference (FOR) encoding and bit-packing. Values that
+cannot be losslessly converted are stored as exceptions. The encoding achieves
+high compression for decimal-like floating-point data (e.g., monetary values,
+sensor readings) while remaining fully lossless.
+
+#### Overview
+
+ALP encoding consists of a page-level header followed by an offset array and 
one
+or more encoded vectors (batches of values). Each vector contains up to
+`vector_size` elements (default 1024).
+
+```
++-------------+-----------------------------+--------------------------------------+
+|   Header    |        Offset Array         |            Vector Data           
    |
+|  (7 bytes)  |   (num_vectors * 4 bytes)   |            (variable)            
    |
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+| Page Header | off0 | off1 | ... | off N-1 | Vector 0 | Vector 1 | ... | Vec 
N-1  |
+|  (7 bytes)  | (4B) | (4B) |     |  (4B)   |(variable)|(variable)|     
|(variable)|
++-------------+------+------+-----+---------+----------+----------+-----+----------+
+```
+
+The compression pipeline for each vector is:
+
+```
+                    Input: float/double array
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  1. SAMPLING & PRESET GENERATION                         |
+    |     Sample vectors from column chunk                     |
+    |     Try all (exponent, factor) combinations              |
+    |     Select best k combinations for preset                |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  2. DECIMAL ENCODING                                     |
+    |     encoded[i] = round(value[i] * 10^e * 10^(-f))       |
+    |     Detect exceptions where decode(encode(v)) != v       |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  3. FRAME OF REFERENCE (FOR)                             |
+    |     min_val = min(encoded[])                             |
+    |     delta[i] = encoded[i] - min_val                      |
+    +----------------------------------------------------------+
+                              |
+                              v
+    +----------------------------------------------------------+
+    |  4. BIT PACKING                                          |
+    |     bit_width = ceil(log2(max_delta + 1))                |
+    |     Pack each delta into bit_width bits                  |
+    +----------------------------------------------------------+
+                              |
+                              v
+                   Output: Serialized vector bytes
+```
+
+#### Page Layout
+
+##### Header (7 bytes)
+
+All multi-byte values are little-endian.
+
+```
+ Byte:    0              1               2              3    4    5    6
+       +----------------+---------------+--------------+----+----+----+----+
+       | compression    | integer       | log_vector   |     num_elements  |
+       | _mode          | _encoding     | _size        |     (int32 LE)    |
+       +----------------+---------------+--------------+----+----+----+----+
+```
+
+| Offset | Field | Size | Type | Description |
+|--------|-------|------|------|-------------|
+| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
+| 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + 
bit-packing) |
+| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 
15\]. Default: 10 (vector size 1024) |
+| 3 | num_elements | 4 bytes | int32 | Total number of floating-point values 
in the page |
+
+The number of vectors is `ceil(num_elements / vector_size)`. The last vector 
may
+contain fewer than `vector_size` elements.
+
+**Note:** The number of elements per vector and the packed data size are NOT 
stored
+in the header. They are derived:
+* Elements per vector: `vector_size` for all vectors except the last, which 
may be smaller.
+* Packed data size: `ceil(num_elements_in_vector * bit_width / 8)`.

Review Comment:
   I also hihglighted this as confusing when reading the spec. I recommend 
removing this sentence as the packed data size is covered in the "Vector 
Format" section below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to