[I] Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) [datafusion]

via GitHub Sat, 18 Oct 2025 00:06:14 -0700


ianthetechie opened a new issue, #18070:
URL: https://github.com/apache/datafusion/issues/18070


   ### Describe the bug
   
   I discovered a significant performance regression in the DataFusion 50 
release. The essence of it is that filtering/joining based on array membership 
is no longer performant in the DF 50 series. This is a relative regression from 
DF 49, where this was reasonably performant.
   
   NOTE: I do realize that this is not a particularly great design, and that we 
can't effectively filter on arrays (as far as I know?) like this using pushdown 
techniques or probably even row group metadata. Unfortunately I'm just a 
consumer of this data set, and this is the only way I found to do the job. 
Suggestions to improve the query are of course also welcome :)
   
   ### To Reproduce
   
   Here is SQL to reproduce in the `datafusion-cli`.
   
   First, let's set up two external tables.
   
   ```sql
   CREATE EXTERNAL TABLE categories_raw STORED AS PARQUET LOCATION 
's3://fsq-os-places-us-east-1/release/dt=2025-09-09/categories/parquet/';
   
   CREATE EXTERNAL TABLE places STORED AS PARQUET LOCATION 
's3://fsq-os-places-us-east-1/release/dt=2025-09-09/places/parquet/';
   ```
   
   Then, let's try doing a simple count of the places matching some subset of 
the categories (notes on the CTE in the additional context below).
   
   Example query 1 (this is roughly our original query):
   
   ```sql
   WITH categories_arr AS (
       SELECT array_agg(category_id) AS category_ids FROM categories_raw LIMIT 
500
   )
   SELECT COUNT(*)
       FROM places p
       WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' AND 
array_has_any(p.fsq_category_ids, (SELECT category_ids FROM categories_arr));
   ```
   
   Example query 2:
   
   ```sql
   WITH allowed_categories AS (
       select * from categories_raw LIMIT 500
   )
   SELECT count(*)
   FROM places p
   JOIN allowed_categories ac ON array_has(p.fsq_category_ids, ac.category_id)
   WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days';
   ```
   
   ### Expected behavior
   
   TL;DR, I _expect_ the queries to execute fairly quickly (**low single digit 
number of minutes**, with some reasonable variance for internet... I'm 
literally on the other side of the world from us-east-1 with a reasonably fast 
connection).
   
   I also expect the query execution time, at least for this particular case, 
to not significantly vary based on the number of categories filtered.
   
   Finally, I'd expect the CPU usage to either go down from current levels like 
DataFusion 49, OR if it's going to leverage more cores, to actually go faster 
;) (DataFusion 49 rarely used more than a single core computing the count as 
described in Query 1, and did it much faster.)
   
   Let me explain what I observed both queries doing under different DataFusion 
releases...
   
   ## Query 1
   
   **Query 1 takes approximately 2 and a half minutes for me on DataFusion 
49.0.2.** This is the baseline and seems reasonable. **Increasing the LIMIT to 
1000 (more realistic), the execution time remains constant.** CPU usage is 
moderate; it does not need to saturate all cores.
   
   Under DataFusion 50, this just immediately consumes 100% of all CPU (not 
even bouncing; straight up redline haha), and doesn't seem to do much in the 
way of data transfer. I killed the query after around 20 minutes, since it was 
clear that it wasn't going to be able to complete in a reasonable time without 
any data transfer.
   
   This behavior in DataFusion 50 occurs regardless of what the LIMIT value is. 
Even a `LIMIT 10` sends all CPUs spinning, and no significant data transfer 
ever occurs.
   
   I don't know how relevant this tidbit is, but unlike most flavors of high 
CPU utilization, this doesn't make my MacBook hot to the touch or even trigger 
fans. System load averages are around 100, but it's clearly not making any 
progress 🤷  No unusual memory activity either.
   
   ## Query 2
   
   I discovered Query 2 when looking for a workaround for DataFusion 50. It 
expresses a similar intent, and still uses an array operation (`array_has` in 
this case), but it's expressed more explicitly as a join condition rather than 
a weird subquery. However, it has _wildly_ different performance 
characteristics.
   
   On DataFusion 49, this **consumes a fair bit of CPU, than Query 1, and it 
took around 22 minutes**! That's like 10x longer than the original query.
   
   DataFusion 50 fares a bit better better. It uses _even more CPU_ than 49 
does initially, and seems to start pulling from the network sooner/faster. **It 
completes this query in just under 8 minutes.**
   
   So that's an improvement, right? Well... **if you try doubling the `LIMIT` 
to 1000 (more representative of my typical query), the run time (DataFusion 50) 
approximately doubles to 15 minutes!** This is a bit strange. But at least it 
finishes 😅 
   
   ## Reduced throughput streaming results
   
   What I've included above is just the count which we do beforehand by calling 
`.count()` with a more useful select query. Once we start actually streaming 
the results, throughput is unsurprisingly slower than before. I'm seeing 
somewhere between 10% and (absolute best case 50%) of the former throughput (in 
records/second) through our pipeline.
   
   Unfortunately the pipeline has too many downstream variables for me to 
cleanly disentangle that into an MRE without a lot of effort, but I guess I 
shouldn't be too surprised by this.
   
   ### Additional context
   
   ## Versions and test methods
   
   I initially observed the regression on several crates.io releases as a 
dependency in our internal application (release build). I then reproduced using 
the respective datafusion-cli versions to make things easier to write up.
   
   For DataFusion 49, I checked out the latest commit on `branch-49`, which is 
tagged as 49.0.2 and has hash f43df3f2ae3aafb347996c58e852cc378807095b. I then 
built datafusion-cli in release mode.
   
   For DataFusion 50, I used CLI version 50.2.0 via homebrew.
   
   All observations are on an M1 Max MacBook Pro running macOS Sequoia 15.6.1. 
   
   I have _not_ enabled any specific session context changes (code) or set any 
explicit options in the CLI.
   
   ## Notes on category filter
   
   The "real world" version of this query uses a computed VIEW with a long 
WHERE clause. A size-tunable simple CTE is simpler and better for the purpose 
of illustration though.
   
   The "real" category filter includes about 1100 categories, and is a 
significant portion of the raw category list (1245 entries total). I have 
tested multiple variations of this, and there is (as one would expect) no 
difference between referencing the view and any random sample of 1100 
categories.
   
   ## Size of the array seems to matter?
   
   `array_has_any` with a small cardinality list seems to be fine. For example, 
you can add `(unresolved_flags IS NULL OR NOT array_has_any(unresolved_flags, 
['inappropriate', 'duplicate']))` to the WHERE clause, it's fine. This actually 
may seem to contradict what I wrote above about `LIMIT 10` not working on Query 
1. Maybe there's something with static arrays vs computed ones? I have not had 
a chance to explore this yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Datafusion 50 Performance Regression (array_has style filter/join for Parquet data set) [datafusion]

Reply via email to