[I] Improve IN_LIST performance -- Implement specialized `StaticFilters` for different data types [datafusion]

via GitHub Wed, 19 Nov 2025 08:08:21 -0800


alamb opened a new issue, #18824:
URL: https://github.com/apache/datafusion/issues/18824

The idea is to improve the INLIST performance by using specialized HashSets
for different data types, and thus avoiding dynamic dispatch for different types

in https://github.com/apache/datafusion/pull/18449 we implemented such a
specialization for `Int32` but we should probably do it for all the types that
had a
[specialization](https://github.com/apache/datafusion/pull/18449/files#diff-ff8086fafbfe5021e5f7d51d96aaae2cf65f779ac3fae5fc182f87e956bb0550L186)
previously
1. All primitive types (Int8, Int32, etc)
2. Boolean
3. Utf8/LargeUtf8/Utf8View
4. Binary/LargeBinary/BinaryView

As @adriangb says:

I'm surprised that doing dynamic dispatch once per batch we evaluate as
opposed to twice per batch we evaluate makes that much of a difference. What
would make sense that makes a difference to me is doing it once per element vs.
once per batch. But I guess that's what benchmarks say!

That does leave me with a question... could we squeeze out even more
performance if we specialize for ~ all scalar types? It wouldn't be that hard
to write a macro and have AI do the copy pasta of implementing it for all of
the types... I'll open a follow up ticket.

_Originally posted by @adriangb in
https://github.com/apache/datafusion/issues/18449#issuecomment-3546450771_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Improve IN_LIST performance -- Implement specialized `StaticFilters` for different data types [datafusion]

Reply via email to