alamb opened a new issue, #18824: URL: https://github.com/apache/datafusion/issues/18824
The idea is to improve the INLIST performance by using specialized HashSets for different data types, and thus avoiding dynamic dispatch for different types in https://github.com/apache/datafusion/pull/18449 we implemented such a specialization for `Int32` but we should probably do it for all the types that had a [specialization](https://github.com/apache/datafusion/pull/18449/files#diff-ff8086fafbfe5021e5f7d51d96aaae2cf65f779ac3fae5fc182f87e956bb0550L186) previously 1. All primitive types (Int8, Int32, etc) 2. Boolean 3. Utf8/LargeUtf8/Utf8View 4. Binary/LargeBinary/BinaryView As @adriangb says: I'm surprised that doing dynamic dispatch once per batch we evaluate as opposed to twice per batch we evaluate makes that much of a difference. What would make sense that makes a difference to me is doing it once per element vs. once per batch. But I guess that's what benchmarks say! That does leave me with a question... could we squeeze out even more performance if we specialize for ~ all scalar types? It wouldn't be that hard to write a macro and have AI do the copy pasta of implementing it for all of the types... I'll open a follow up ticket. _Originally posted by @adriangb in https://github.com/apache/datafusion/issues/18449#issuecomment-3546450771_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
