mbutrovich opened a new issue, #4279:
URL: https://github.com/apache/datafusion-comet/issues/4279

   ## What is the problem the feature request solves?
   
   `CometPlainVector.getBoolean(int)` already caches the last-read data-bitmap 
byte to avoid re-reading it eight times per byte of bit-packed booleans:
   
   ```java
   private byte booleanByteCache;
   private int booleanByteCacheIndex = -1;
   
   public boolean getBoolean(int rowId) {
     int byteIndex = rowId >> 3;
     if (byteIndex != booleanByteCacheIndex) {
       booleanByteCache = getByte(byteIndex);
       booleanByteCacheIndex = byteIndex;
     }
     return ((booleanByteCache >> (rowId & 7)) & 1) == 1;
   }
   ```
   
   The validity bitmap (checked by `isNullAt`) is also bit-packed one bit per 
row, but it has no equivalent cache. Sequential row reads on a nullable column 
hit the same validity-buffer byte eight times in a row. Every caller that reads 
a column row-by-row (including the native scan output path, shuffle read path, 
and the JVM UDF dispatch kernel in #4267) pays the duplicate reads.
   
   ## Describe the potential solution
   
   Add a sibling byte cache for the validity bitmap, using the same pattern 
already present for `booleanByteCache`:
   
   ```java
   private byte validityByteCache;
   private int validityByteCacheIndex = -1;
   
   @Override
   public boolean isNullAt(int rowId) {
     if (this.valueBufferAddress == -1) return true;
     int byteIndex = rowId >> 3;
     if (byteIndex != validityByteCacheIndex) {
       validityByteCache = /* read validity-buffer byte at byteIndex */;
       validityByteCacheIndex = byteIndex;
     }
     return ((validityByteCache >> (rowId & 7)) & 1) == 0;
   }
   
   @Override
   public void setNumNulls(int numNulls) {
     super.setNumNulls(numNulls);
     this.booleanByteCacheIndex = -1;
     this.validityByteCacheIndex = -1;  // invalidate on state transitions
   }
   ```
   
   Collapses the 8x duplicate validity-byte reads to one read per 8 rows on 
sequential access.
   
   ## Additional context
   
   - Mirrors the established pattern in the same class (`booleanByteCache`).
   - The existing `setNumNulls` override resets `booleanByteCacheIndex`; the 
validity cache would extend that.
   - No API change. No behavior change. Strictly reduces per-row work for any 
nullable column read sequentially.
   - Random-access patterns see no win but no regression either (cache miss on 
every call, same cost as today after the miss).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to