Dandandan opened a new issue, #3518:
URL: https://github.com/apache/arrow-datafusion/issues/3518

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Currently by far the worst performing query in ClickBench for datafusion is 
query 28:
   
   ```
   SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS 
k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE 
"Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
   ```
   
   It seems this is due too `REGEXP_REPLACE` in this query.
   
   **Describe the solution you'd like**
   Introduce a benchmark and improve the performance of `REGEXP_REPLACE` (with 
a scalar) in DataFusion.
   
   **Describe alternatives you've considered**
   n/a
   **Additional context**
   
   There are some more efficient patterns in arrow-rs for writing 
(string/regex) kernels. This could be used here as well.
   Optimizations that might be beneficial:
   - reusing the null buffer from the input array (instead of rebuilding in the 
iterator)
   - special casing scalar type, so no hashmap has to be used for storing / 
retrieving compiled regexes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to