davidhewitt opened a new pull request, #15354: URL: https://github.com/apache/datafusion/pull/15354
## Which issue does this PR close? - Closes #14533 ## Rationale for this change This PR indirectly addresses #14533 not by actually changing the `array_has` evaluation but instead by simplifying it to the equivalent `InList` expression where the haystack is not varying per-row. The `array_has` udf has to operate row-by-row because it may have a varying haystack. The `InList` expression, on the other hand, can operate in a columnar fashion by evaluating each of the N haystack items for equality against the needle and OR the results. It looks to me list `InList` also supports some kind of `Set` optimization. ## What changes are included in this PR? Add a `simplify` implementation to `array_has` UDF which will produce an `InList` expr when the haystack is a literal list. ## Are these changes tested? Yes, see test additions in the diff. I also reran the original example from #14533 and we see that now the last two statements are now on equivalent performance as the others. ``` > CREATE TABLE test AS (SELECT substr(md5(i)::text, 1, 32) as haystack FROM generate_series(1, 100000) t(i)); 0 row(s) fetched. Elapsed 0.027 seconds. > SELECT * FROM test limit 1; +----------------------------------+ | haystack | +----------------------------------+ | 7f4b18de3cfeb9b4ac78c381ee2ad278 | +----------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds. > SELECT count(*) FROM test WHERE haystack = '7f4b18de3cfeb9b4ac78c381ee2ad278'; +----------+ | count(*) | +----------+ | 1 | +----------+ 1 row(s) fetched. Elapsed 0.005 seconds. > SELECT count(*) FROM test WHERE haystack IN ('7f4b18de3cfeb9b4ac78c381ee2ad278'); +----------+ | count(*) | +----------+ | 1 | +----------+ 1 row(s) fetched. Elapsed 0.002 seconds. > SELECT count(*) FROM test WHERE haystack = ANY(['7f4b18de3cfeb9b4ac78c381ee2ad278']); +----------+ | count(*) | +----------+ | 1 | +----------+ 1 row(s) fetched. Elapsed 0.001 seconds. > SELECT count(*) FROM test WHERE array_has(['7f4b18de3cfeb9b4ac78c381ee2ad278'], haystack); +----------+ | count(*) | +----------+ | 1 | +----------+ 1 row(s) fetched. Elapsed 0.002 seconds. ``` I would be happy to contribute a benchmark, but because this involves first simplifying the UDF expression this looked somewhat nontrivial and I'd welcome advice on where to place it. ## Are there any user-facing changes? Simplification results will change. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org