davidhewitt opened a new pull request, #15354:
URL: https://github.com/apache/datafusion/pull/15354

   ## Which issue does this PR close?
   
   - Closes #14533 
   
   ## Rationale for this change
   
   This PR indirectly addresses #14533 not by actually changing the `array_has` 
evaluation but instead by simplifying it to the equivalent `InList` expression 
where the haystack is not varying per-row.
   
   The `array_has` udf has to operate row-by-row because it may have a varying 
haystack. The `InList` expression, on the other hand, can operate in a columnar 
fashion by evaluating each of the N haystack items for equality against the 
needle and OR the results. It looks to me list `InList` also supports some kind 
of `Set` optimization.
   
   ## What changes are included in this PR?
   
   Add a `simplify` implementation to `array_has` UDF which will produce an 
`InList` expr when the haystack is a literal list.
   
   ## Are these changes tested?
   
   Yes, see test additions in the diff.
   
   I also reran the original example from #14533 and we see that now the last 
two statements are now on equivalent performance as the others.
   
   ```
   > CREATE TABLE test AS (SELECT substr(md5(i)::text, 1, 32) as haystack FROM 
generate_series(1, 100000) t(i));
   0 row(s) fetched. 
   Elapsed 0.027 seconds.
   
   > SELECT * FROM test limit 1;
   +----------------------------------+
   | haystack                         |
   +----------------------------------+
   | 7f4b18de3cfeb9b4ac78c381ee2ad278 |
   +----------------------------------+
   1 row(s) fetched. 
   Elapsed 0.001 seconds.
   
   > SELECT count(*) FROM test WHERE haystack = 
'7f4b18de3cfeb9b4ac78c381ee2ad278';
   +----------+
   | count(*) |
   +----------+
   | 1        |
   +----------+
   1 row(s) fetched. 
   Elapsed 0.005 seconds.
   
   > SELECT count(*) FROM test WHERE haystack IN 
('7f4b18de3cfeb9b4ac78c381ee2ad278');
   +----------+
   | count(*) |
   +----------+
   | 1        |
   +----------+
   1 row(s) fetched. 
   Elapsed 0.002 seconds.
   
   > SELECT count(*) FROM test WHERE haystack = 
ANY(['7f4b18de3cfeb9b4ac78c381ee2ad278']);
   +----------+
   | count(*) |
   +----------+
   | 1        |
   +----------+
   1 row(s) fetched. 
   Elapsed 0.001 seconds.
   
   > SELECT count(*) FROM test WHERE 
array_has(['7f4b18de3cfeb9b4ac78c381ee2ad278'], haystack);
   +----------+
   | count(*) |
   +----------+
   | 1        |
   +----------+
   1 row(s) fetched. 
   Elapsed 0.002 seconds.
   ```
   
   I would be happy to contribute a benchmark, but because this involves first 
simplifying the UDF expression this looked somewhat nontrivial and I'd welcome 
advice on where to place it.
   
   ## Are there any user-facing changes?
   
   Simplification results will change.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to