alamb commented on PR #18873: URL: https://github.com/apache/datafusion/pull/18873#issuecomment-3567920485
I did some more analysis: The idea is to isolate why filter pushdown is slowing down clickbench q24 See more details here https://github.com/apache/datafusion/pull/18873 This is after upgrading to arrow 57.1.0 The only difference in the two binaries is if filter pushdown is on by default: ```shell -rwxr-xr-x@ 1 andrewlamb staff 81331152 Nov 23 07:31 datafusion-cli-alamb_upgrade_arrow_57.1.0 -rwxr-xr-x@ 1 andrewlamb staff 81331152 Nov 22 07:57 datafusion-cli-almab_pushdown_no_reorder ``` Using hits partitioned dataset ```shell ln -s ~/Software/datafusion/benchmarks/data/hits_partitioned ./hits ``` Here is q24.sql ```sql set datafusion.execution.parquet.binary_as_string = true; -- turn on pushdown (is hard coded) -- set datafusion.execution.parquet.pushdown_filters = true; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10; ``` You can see the pushdown is slightly slower ```shell ./datafusion-cli-almab_pushdown_no_reorder -f q24.sql | grep Elapsed Elapsed 0.000 seconds. Elapsed 0.183 seconds. Elapsed 0.154 seconds. Elapsed 0.155 seconds. Elapsed 0.153 seconds. Elapsed 0.154 seconds. Elapsed 0.154 seconds. Elapsed 0.150 seconds. Elapsed 0.154 seconds. Elapsed 0.156 seconds. Elapsed 0.152 seconds. ``` ```shell ./datafusion-cli-alamb_upgrade_arrow_57.1.0 -f q24.sql | grep Elapsed Elapsed 0.002 seconds. Elapsed 0.164 seconds. Elapsed 0.137 seconds. Elapsed 0.137 seconds. Elapsed 0.133 seconds. Elapsed 0.132 seconds. Elapsed 0.135 seconds. Elapsed 0.131 seconds. Elapsed 0.137 seconds. Elapsed 0.137 seconds. Elapsed 0.133 seconds. ``` So let's profile what the pushdown one is doing <img width="1473" height="535" alt="Screenshot 2025-11-23 at 7 40 41 AM" src="https://github.com/user-attachments/assets/22da52fc-4b1a-485c-8ccc-529794d6ec7b" /> So more than 5% of the time is being spent converting filters back and forth. Thus, this gives me more motivation to keep working on - https://github.com/apache/arrow-rs/issues/8844 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
