alamb commented on PR #18873:
URL: https://github.com/apache/datafusion/pull/18873#issuecomment-3567920485

   I did some more analysis:
   
   The idea is to isolate why filter pushdown is slowing down clickbench q24
   
   See more details here https://github.com/apache/datafusion/pull/18873
   
   This is after upgrading to arrow 57.1.0
   
   The only difference in the two binaries is if filter pushdown is on by 
default:
   ```shell
   -rwxr-xr-x@ 1 andrewlamb  staff  81331152 Nov 23 07:31 
datafusion-cli-alamb_upgrade_arrow_57.1.0
   -rwxr-xr-x@ 1 andrewlamb  staff  81331152 Nov 22 07:57 
datafusion-cli-almab_pushdown_no_reorder
   ```
   
   Using hits partitioned dataset
   ```shell
   ln -s ~/Software/datafusion/benchmarks/data/hits_partitioned ./hits
   ```
   
   Here is q24.sql
   
   ```sql
   set datafusion.execution.parquet.binary_as_string = true;
   
   -- turn on pushdown (is hard coded)
   -- set datafusion.execution.parquet.pushdown_filters = true;
   
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"SearchPhrase" LIMIT 10;
   ```
   
   You can see the pushdown is slightly slower
   
   ```shell
   ./datafusion-cli-almab_pushdown_no_reorder -f q24.sql  | grep Elapsed
   Elapsed 0.000 seconds.
   Elapsed 0.183 seconds.
   Elapsed 0.154 seconds.
   Elapsed 0.155 seconds.
   Elapsed 0.153 seconds.
   Elapsed 0.154 seconds.
   Elapsed 0.154 seconds.
   Elapsed 0.150 seconds.
   Elapsed 0.154 seconds.
   Elapsed 0.156 seconds.
   Elapsed 0.152 seconds.
   ```
   
   ```shell
   ./datafusion-cli-alamb_upgrade_arrow_57.1.0  -f q24.sql  | grep Elapsed
   Elapsed 0.002 seconds.
   Elapsed 0.164 seconds.
   Elapsed 0.137 seconds.
   Elapsed 0.137 seconds.
   Elapsed 0.133 seconds.
   Elapsed 0.132 seconds.
   Elapsed 0.135 seconds.
   Elapsed 0.131 seconds.
   Elapsed 0.137 seconds.
   Elapsed 0.137 seconds.
   Elapsed 0.133 seconds.
   ```
   
   So let's profile what the pushdown one is doing
   
   <img width="1473" height="535" alt="Screenshot 2025-11-23 at 7 40 41 AM" 
src="https://github.com/user-attachments/assets/22da52fc-4b1a-485c-8ccc-529794d6ec7b";
 />
   
   So more than 5% of the time is being spent converting filters back and 
forth. 
   
   
   Thus, this gives me more motivation to keep working on 
   - https://github.com/apache/arrow-rs/issues/8844
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to