Re: [I] Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) [datafusion]

via GitHub Sun, 19 Oct 2025 22:13:10 -0700


2010YOUY01 commented on issue #18070:
URL: https://github.com/apache/datafusion/issues/18070#issuecomment-3420597191


   https://github.com/apache/datafusion/pull/18161 is the direct fix to the 
issue, after that the performance should be similar to DF49
   
   However, I think in some cases `unnest` can be used for the list column in 
the query, so that a much faster Hash Join operator can be applied, below is a 
query rewriting Q2 with `unnest` (I believe it needs an additional relatively 
cheap dedup step to make them equivalent), it's 300x faster on my machine
   ```
   > WITH allowed_categories AS (
       select * from categories_raw LIMIT 500
   ),
   unnested_places AS (
           select unnest(fsq_category_ids) as unnested_fsq_category_id
           from places
           WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days'
   )
   SELECT count(*)
   FROM unnested_places up
   JOIN allowed_categories ac
   on up.unnested_fsq_category_id = ac.category_id;
   
   +----------+
   | count(*) |
   +----------+
   | 7675969  |
   +----------+
   1 row(s) fetched.
   Elapsed 0.435 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) [datafusion]

Reply via email to