Re: [PR] PAX: Support LZ4 compression for table columns [cloudberry]

via GitHub Wed, 03 Sep 2025 01:27:44 -0700


gongxun0928 commented on code in PR #1344:
URL: https://github.com/apache/cloudberry/pull/1344#discussion_r2318188376



##########
contrib/pax_storage/src/cpp/storage/columns/pax_compress.cc:
##########
@@ -230,9 +234,14 @@ size_t PaxLZ4Compressor::GetCompressBound(size_t src_len) {
 }
 
 size_t PaxLZ4Compressor::Compress(void *dst_buff, size_t dst_cap,
-                                  void *src_buff, size_t src_len, int /*lvl*/) 
{
-  return LZ4_compress_default((char *)src_buff, (char *)dst_buff, src_len,
-                              dst_cap);
+                                  void *src_buff, size_t src_len, int lvl) {
+#define LZ4_MAX_ACC 65536
+  // acceleration is oppsite meaning of compress level
+  // map [19, 0] to [0, LZ4_MAX_ACC]
+  int acceleration = 19 - lvl;
+  acceleration = (int)(acceleration * LZ4_MAX_ACC / 20.0);

Review Comment:
   FYI
   I have little experience with acceleration settings. After reviewing Google 
and feedback from claude-opus-4.1, I've identified several issues:
   
   1. Incorrect parameter understanding: The acceleration parameter in 
LZ4_compress_fast() is not the inverse of the compression level. It's an 
independent parameter that controls the speed/compression ratio tradeoff.
   
   2. Problematic mapping logic: Your code maps level from [0, 19] to 
acceleration [65536, 0], which has several issues:
     * acceleration = 1 is the default (best compression ratio)
     * acceleration > 1 speeds up compression but reduces the compression ratio
     * acceleration = 65536 is an extreme value that results in almost no 
compression
   
   claude-opus-4.1 show two solution:
   Solution 1: Hybrid Fast and High Compression Modes
   ```cpp
   size_t PaxLZ4Compressor::Compress(void *dst_buff, size_t dst_cap,
                                     void *src_buff, size_t src_len, int lvl) {
       // lvl: [0-19]
       // 0-9: Use fast mode with different acceleration values
       // 10-19: Use high compression mode with different compression levels
       
       if (lvl < 10) {
           // Fast mode: map lvl [0-9] to acceleration [20, 1]
           // Lower lvl = higher acceleration = faster speed but lower 
compression ratio
           int acceleration = 20 - (lvl * 2);
           return LZ4_compress_fast((char *)src_buff, (char *)dst_buff, 
                                    src_len, dst_cap, acceleration);
       } else {
           // High compression mode: map lvl [10-19] to compressionLevel [1-16]
           // Higher lvl = higher compressionLevel = better compression ratio
           int compressionLevel = (lvl - 9) * 16 / 10;
           compressionLevel = std::min(compressionLevel, 16);
           return LZ4_compress_HC((char *)src_buff, (char *)dst_buff, 
                                  src_len, dst_cap, compressionLevel);
       }
   }
   ```
   
   Solution 2: Adaptive Strategy Based on Block Size
   ```cpp
   size_t PaxLZ4Compressor::Compress(void *dst_buff, size_t dst_cap,
                                     void *src_buff, size_t src_len, int lvl) {
       // For columnar storage, use fast compression for small blocks,
       // and allow higher compression overhead for larger blocks
       const size_t SMALL_BLOCK = 4 * 1024;    // 4KB threshold
       const size_t MEDIUM_BLOCK = 64 * 1024;  // 64KB threshold
       
       int acceleration;
       
       if (src_len < SMALL_BLOCK) {
           // Small blocks: prioritize speed over compression ratio
           // Map lvl [0-19] to acceleration [20, 1]
           acceleration = std::max(1, 20 - lvl);
       } else if (src_len < MEDIUM_BLOCK) {
           // Medium blocks: balance between speed and compression ratio
           // Map lvl [0-19] to acceleration [10, 1] with slower progression
           acceleration = std::max(1, 10 - lvl/2);
       } else {
           // Large blocks: can afford higher compression overhead
           if (lvl < 15) {
               // For moderate compression levels, still use fast mode
               // but with lower acceleration values for better compression
               acceleration = std::max(1, 5 - lvl/4);
               return LZ4_compress_fast((char *)src_buff, (char *)dst_buff, 
                                        src_len, dst_cap, acceleration);
           } else {
               // For high compression levels (15-19), switch to HC mode
               // Map lvl [15-19] to compressionLevel [4-16]
               int compressionLevel = (lvl - 14) * 4;
               compressionLevel = std::min(compressionLevel, 16);
               return LZ4_compress_HC((char *)src_buff, (char *)dst_buff, 
                                      src_len, dst_cap, compressionLevel);
           }
       }
       
       return LZ4_compress_fast((char *)src_buff, (char *)dst_buff, 
                                src_len, dst_cap, acceleration);
   }
   ```
   For solution 2, opus-4.1 provides a version that eliminates branch 
prediction by using a table lookup method.
   ```cpp
   class PaxLZ4Compressor {
   private:
       // Precomputed lookup table
       static constexpr int MAX_LEVEL = 20;
       // Size thresholds for different compression strategies
       static constexpr size_t SIZE_THRESHOLDS[] = {4*1024, 64*1024, SIZE_MAX};
       
       // Precomputed acceleration value table [size_category][level]
       // Optimization: Use lookup table to avoid runtime calculations
       static constexpr int ACCELERATION_TABLE[3][MAX_LEVEL] = {
           // Small chunks (< 4KB) - prioritize speed
           {20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 
1},
           // Medium chunks (4KB - 64KB) - balance speed and ratio
           {10, 10, 9, 9, 8, 8, 7, 7, 6, 6, 5, 5, 4, 4, 3, 3, 2, 2, 1, 1},
           // Large chunks (> 64KB) - prioritize compression ratio
           {5, 5, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1}
       };
       
   public:
       size_t Compress(void *dst_buff, size_t dst_cap,
                      void *src_buff, size_t src_len, int lvl) {
           // Use bitwise operations to avoid branch prediction penalties
           // Determines size category: 0=small, 1=medium, 2=large
           int size_category = (src_len >= SIZE_THRESHOLDS[0]) + 
                              (src_len >= SIZE_THRESHOLDS[1]);
           
           // Clamp level to valid range and get acceleration from precomputed 
table
           int acceleration = ACCELERATION_TABLE[size_category][std::min(lvl, 
MAX_LEVEL-1)];
           
           return LZ4_compress_fast((char *)src_buff, (char *)dst_buff, 
                                    src_len, dst_cap, acceleration);
       }
   };
   
   // Implementation Notes and Considerations:
   // 1. This implementation is more reasonable as it provides a balanced 
approach
   //    between compression speed and ratio based on data chunk size.
   // 2. The acceleration values are constrained to a reasonable range (1-20),
   //    avoiding the extreme values (0-65536) from the previous implementation.
   // 3. The design follows the principle: "small chunks prioritize speed,
   //    large chunks prioritize compression ratio."
   // 4. The use of a precomputed lookup table eliminates runtime calculations,
   //    improving performance.
   // 5. The level mapping is intuitive: level 0 = fastest, level 19 = best 
compression.
   //
   // Important Considerations:
   // 1. Validate the acceleration values in the table with real-world data
   // 2. Consider adding more size categories for finer granularity
   // 3. For known high-redundancy data types (e.g., booleans, enums),
   //    consider using lower acceleration values regardless of size
   // 4. For known low-redundancy data (e.g., floats, hashes),
   //    consider using higher acceleration values
   // 5. Add boundary checks to ensure level parameter is within valid range
   // 6. Consider implementing adaptive adjustment mechanism based on historical
   //    compression ratios for each data type
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PAX: Support LZ4 compression for table columns [cloudberry]

Reply via email to