Fengzdadi opened a new issue, #721:
URL: https://github.com/apache/datasketches-java/issues/721

   ## Description
   In Java VarOptItemsSketch heapify, full‑preamble deserialization currently 
only checks that `rCount > 0`, but does not validate the `totalRWeight` field. 
C++ (`var_opt_sketch_impl.hpp`) rejects NaN or non‑positive `total_wt_r`, and 
Go was updated to follow C++ behavior. This creates a cross‑language 
inconsistency and allows corrupted or malicious bytes to silently produce 
NaN/negative tau and invalid estimates.
   
   ## Current Java code
   **File:** 
`src/main/java/org/apache/datasketches/sampling/VarOptItemsSketch.java`  
   **Method:** heapify (around where `numPreLongs == 
Family.VAROPT.getMaxPreLongs()`)
   
   Current logic:
   ```java
   double totalRWeight = 0.0;
   if (numPreLongs == Family.VAROPT.getMaxPreLongs()) {
     if (rCount > 0) {
       totalRWeight = extractTotalRWeight(srcSeg);
     } else {
       throw new SketchesArgumentException(
           "Possible Corruption: " + Family.VAROPT.getMaxPreLongs()
           + " preLongs but no items in R region");
     }
   }
   ```
   
   ## Behavior in other languages
   **C++** (`sampling/include/var_opt_sketch_impl.hpp`):
   - Full preamble → if `r == 0` or `total_wt_r` is NaN or <= 0, throw  
     ```
     "Possible corruption: deserializing in full mode but r = 0 or invalid R 
weight.
      Found r = ..., R region weight = ..."
     ```
   
   **Go** (`sampling/varopt_items_sketch.go`):
   - Full preamble → reject NaN or <= 0 totalRWeight (aligned with C++)
   
   ## Proposed change
   Add validation after `extractTotalRWeight` in Java:
   ```java
   double totalRWeight = 0.0;
   if (numPreLongs == Family.VAROPT.getMaxPreLongs()) {
     if (rCount > 0) {
       totalRWeight = extractTotalRWeight(srcSeg);
       if (Double.isNaN(totalRWeight) || totalRWeight <= 0.0) {
         throw new SketchesArgumentException(
             "Possible Corruption: deserializing in full mode but r = 0 or 
invalid R weight. "
             + "Found r = " + rCount + ", R region weight = " + totalRWeight);
       }
     } else {
       throw new SketchesArgumentException(
           "Possible Corruption: " + Family.VAROPT.getMaxPreLongs()
           + " preLongs but no items in R region");
     }
   }
   ```
   
   ## Why this matters
   - Prevents NaN/negative tau from silently propagating  
   - Aligns Java behavior with C++  
   - Only rejects invalid/corrupted bytes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to