2010YOUY01 commented on issue #18070: URL: https://github.com/apache/datafusion/issues/18070#issuecomment-3420597191
https://github.com/apache/datafusion/pull/18161 is the direct fix to the issue, after that the performance should be similar to DF49 However, I think in some cases `unnest` can be used for the list column in the query, so that a much faster Hash Join operator can be applied, below is a query rewriting Q2 with `unnest` (I believe it needs an additional relatively cheap dedup step to make them equivalent), it's 300x faster on my machine ``` > WITH allowed_categories AS ( select * from categories_raw LIMIT 500 ), unnested_places AS ( select unnest(fsq_category_ids) as unnested_fsq_category_id from places WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' ) SELECT count(*) FROM unnested_places up JOIN allowed_categories ac on up.unnested_fsq_category_id = ac.category_id; +----------+ | count(*) | +----------+ | 7675969 | +----------+ 1 row(s) fetched. Elapsed 0.435 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
