[PR] Fix BloomFilter buffer incompatibility between Spark and Comet [datafusion-comet]

via GitHub Sun, 28 Dec 2025 01:57:36 -0800


Shekharrajak opened a new pull request, #3003:
URL: https://github.com/apache/datafusion-comet/pull/3003


   Handle Spark's full serialization format (12-byte header + bits) in 
merge_filter() to support Spark partial / Comet final execution. The fix 
automatically detects the format and extracts bits data accordingly.
   
   Fixes #2889
   
   
   
   ## Rationale for this change
   
   Spark's serialize() returns full format: 12-byte header (version + 
numHashFunctions + numWords) + bits data
   Comet's state_as_bytes() returns bits data only
   When Spark partial sends full format, Comet's merge_filter() expects 
bits-only, causing mismatch
   
   Ref 
https://github.com/apache/spark/blob/master/common/sketch/src/main/java/org/apache/spark/util/sketch/BitArray.java#L99
   
   Ref 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L219
   
   Spark format: BloomFilterImpl.writeTo() (4+4 bytes) + BitArray.writeTo() (4 
bytes + bits)
   
   ## What changes are included in this PR?
   
   Detects Spark format (buffer size = 12 + expected_bits_size)
   Extracts bits data by skipping 12-byte header if Spark format
   Returns bits as-is if Comet format
   
   
   ## How are these changes tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Fix BloomFilter buffer incompatibility between Spark and Comet [datafusion-comet]

Reply via email to