Re: [PR] perf: Use batched row conversion for `array_has_any`, `array_has_all` [datafusion]

via GitHub Sun, 08 Mar 2026 19:25:30 -0700


neilconway commented on PR #20588:
URL: https://github.com/apache/datafusion/pull/20588#issuecomment-4020684821


   @martin-g Do you think you might have a chance to take another look at this 
PR? I believe it should be in fairly good shape.
   
   One point about the chunk size: the value I picked seems to work reasonably 
well for 500 element arrays on the (relatively puny) cloud box I ran benchmarks 
on. Of course, real-world arrays could have many more elements, but
   
   (1) typical prod hardware will have larger caches
   (2) the perf regression is moderate but not catastrophic 
   (3) it is hard to get optimal cache behavior for a wide range of workloads 
and hardware configs without doing something much fancier
   (4) we do N*M comparisons, which is not good for large arrays anyway; once 
this lands I'd like to take a look at using a hash map for large arrays, which 
I'm guessing will be a significant win
   
   Notwithstanding all that, we could certainly lower the chunk size to 256 or 
even smaller if we wanted to give ourselves more headroom for wide arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: Use batched row conversion for `array_has_any`, `array_has_all` [datafusion]

Reply via email to