ianthetechie opened a new issue, #18070:
URL: https://github.com/apache/datafusion/issues/18070
### Describe the bug
I discovered a significant performance regression in the DataFusion 50
release. The essence of it is that filtering/joining based on array membership
is no longer performant in the DF 50 series. This is a relative regression from
DF 49, where this was reasonably performant.
NOTE: I do realize that this is not a particularly great design, and that we
can't effectively filter on arrays (as far as I know?) like this using pushdown
techniques or probably even row group metadata. Unfortunately I'm just a
consumer of this data set, and this is the only way I found to do the job.
Suggestions to improve the query are of course also welcome :)
### To Reproduce
Here is SQL to reproduce in the `datafusion-cli`.
First, let's set up two external tables.
```sql
CREATE EXTERNAL TABLE categories_raw STORED AS PARQUET LOCATION
's3://fsq-os-places-us-east-1/release/dt=2025-09-09/categories/parquet/';
CREATE EXTERNAL TABLE places STORED AS PARQUET LOCATION
's3://fsq-os-places-us-east-1/release/dt=2025-09-09/places/parquet/';
```
Then, let's try doing a simple count of the places matching some subset of
the categories (notes on the CTE in the additional context below).
Example query 1 (this is roughly our original query):
```sql
WITH categories_arr AS (
SELECT array_agg(category_id) AS category_ids FROM categories_raw LIMIT
500
)
SELECT COUNT(*)
FROM places p
WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days' AND
array_has_any(p.fsq_category_ids, (SELECT category_ids FROM categories_arr));
```
Example query 2:
```sql
WITH allowed_categories AS (
select * from categories_raw LIMIT 500
)
SELECT count(*)
FROM places p
JOIN allowed_categories ac ON array_has(p.fsq_category_ids, ac.category_id)
WHERE date_refreshed >= CURRENT_DATE - INTERVAL '365 days';
```
### Expected behavior
TL;DR, I _expect_ the queries to execute fairly quickly (**low single digit
number of minutes**, with some reasonable variance for internet... I'm
literally on the other side of the world from us-east-1 with a reasonably fast
connection).
I also expect the query execution time, at least for this particular case,
to not significantly vary based on the number of categories filtered.
Finally, I'd expect the CPU usage to either go down from current levels like
DataFusion 49, OR if it's going to leverage more cores, to actually go faster
;) (DataFusion 49 rarely used more than a single core computing the count as
described in Query 1, and did it much faster.)
Let me explain what I observed both queries doing under different DataFusion
releases...
## Query 1
**Query 1 takes approximately 2 and a half minutes for me on DataFusion
49.0.2.** This is the baseline and seems reasonable. **Increasing the LIMIT to
1000 (more realistic), the execution time remains constant.** CPU usage is
moderate; it does not need to saturate all cores.
Under DataFusion 50, this just immediately consumes 100% of all CPU (not
even bouncing; straight up redline haha), and doesn't seem to do much in the
way of data transfer. I killed the query after around 20 minutes, since it was
clear that it wasn't going to be able to complete in a reasonable time without
any data transfer.
This behavior in DataFusion 50 occurs regardless of what the LIMIT value is.
Even a `LIMIT 10` sends all CPUs spinning, and no significant data transfer
ever occurs.
I don't know how relevant this tidbit is, but unlike most flavors of high
CPU utilization, this doesn't make my MacBook hot to the touch or even trigger
fans. System load averages are around 100, but it's clearly not making any
progress 🤷 No unusual memory activity either.
## Query 2
I discovered Query 2 when looking for a workaround for DataFusion 50. It
expresses a similar intent, and still uses an array operation (`array_has` in
this case), but it's expressed more explicitly as a join condition rather than
a weird subquery. However, it has _wildly_ different performance
characteristics.
On DataFusion 49, this **consumes a fair bit of CPU, than Query 1, and it
took around 22 minutes**! That's like 10x longer than the original query.
DataFusion 50 fares a bit better better. It uses _even more CPU_ than 49
does initially, and seems to start pulling from the network sooner/faster. **It
completes this query in just under 8 minutes.**
So that's an improvement, right? Well... **if you try doubling the `LIMIT`
to 1000 (more representative of my typical query), the run time (DataFusion 50)
approximately doubles to 15 minutes!** This is a bit strange. But at least it
finishes 😅
## Reduced throughput streaming results
What I've included above is just the count which we do beforehand by calling
`.count()` with a more useful select query. Once we start actually streaming
the results, throughput is unsurprisingly slower than before. I'm seeing
somewhere between 10% and (absolute best case 50%) of the former throughput (in
records/second) through our pipeline.
Unfortunately the pipeline has too many downstream variables for me to
cleanly disentangle that into an MRE without a lot of effort, but I guess I
shouldn't be too surprised by this.
### Additional context
## Versions and test methods
I initially observed the regression on several crates.io releases as a
dependency in our internal application (release build). I then reproduced using
the respective datafusion-cli versions to make things easier to write up.
For DataFusion 49, I checked out the latest commit on `branch-49`, which is
tagged as 49.0.2 and has hash f43df3f2ae3aafb347996c58e852cc378807095b. I then
built datafusion-cli in release mode.
For DataFusion 50, I used CLI version 50.2.0 via homebrew.
All observations are on an M1 Max MacBook Pro running macOS Sequoia 15.6.1.
I have _not_ enabled any specific session context changes (code) or set any
explicit options in the CLI.
## Notes on category filter
The "real world" version of this query uses a computed VIEW with a long
WHERE clause. A size-tunable simple CTE is simpler and better for the purpose
of illustration though.
The "real" category filter includes about 1100 categories, and is a
significant portion of the raw category list (1245 entries total). I have
tested multiple variations of this, and there is (as one would expect) no
difference between referencing the view and any random sample of 1100
categories.
## Size of the array seems to matter?
`array_has_any` with a small cardinality list seems to be fine. For example,
you can add `(unresolved_flags IS NULL OR NOT array_has_any(unresolved_flags,
['inappropriate', 'duplicate']))` to the WHERE clause, it's fine. This actually
may seem to contradict what I wrote above about `LIMIT 10` not working on Query
1. Maybe there's something with static arrays vs computed ones? I have not had
a chance to explore this yet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]