alamb opened a new issue, #18181:
URL: https://github.com/apache/datafusion/issues/18181

   ### Is your feature request related to a problem or challenge?
   
   While working on 
   - https://github.com/apache/datafusion/issues/18070
   
   @ianthetechie  provided the following query which executes quite slowly:
   ```sql
   CREATE EXTERNAL TABLE categories_raw STORED AS PARQUET LOCATION 
's3://fsq-os-places-us-east-1/release/dt=2025-09-09/categories/parquet/';
   
   CREATE EXTERNAL TABLE places STORED AS PARQUET LOCATION 
's3://fsq-os-places-us-east-1/release/dt=2025-09-09/places/parquet/';
   
   WITH categories_arr AS (
       SELECT array_agg(category_id) AS category_ids FROM categories_raw LIMIT 
500
   )
   SELECT COUNT(*)
       FROM places p
       WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' AND 
array_has_any(p.fsq_category_ids, (SELECT category_ids FROM categories_arr));
   ```
   
   While the regression in https://github.com/apache/datafusion/issues/18070 
was fixed, there is a lot of room to improve this query's performance still
   
   To reproduce, download 
[slow_array_has.zip](https://github.com/user-attachments/files/23004864/slow_array_has.zip)
 and run:
   
   ```shell
   datafusion-cli -f repro.sql
   ```
   
   60% of the overall query time is spent in `array_has` as can be seen by this 
quick profile
   
   <img width="1832" height="1364" alt="Image" 
src="https://github.com/user-attachments/assets/f5073c67-d8cf-40de-b563-b040f26072b4";
 />
   
   
   ### Describe the solution you'd like
   
   Make `array_has` go faster
   
   
   
   ### Describe alternatives you've considered
   
   @jayzhan211  has some ideas here in 
   - https://github.com/apache/datafusion/issues/12163
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to