Re: [I] Query execution time difference between deltatable QueryBuilder and using DataFusion directly. [datafusion-python]

via GitHub Mon, 23 Mar 2026 02:08:04 -0700


renato2099 commented on issue #1140:
URL: 
https://github.com/apache/datafusion-python/issues/1140#issuecomment-4109057625


   I gave this a try with the following:
   ```
   deltalake              1.5.0
   datafusion             52.3.0
   pyarrow                20.0.0
   ```
   generated some data with the following script 
[generate_delta_traces.py](https://github.com/user-attachments/files/26177626/generate_delta_traces.py)
 and run some queries against that delta file, no special datafusion 
configuration
   
   ```
   def run_delta_query_builder():
       QueryBuilder().register("tbl", dt).execute(sql).read_all()
   
   
   def run_datafusion():
       ctx = SessionContext()
       ctx.register_table("tbl", dt)
       ctx.sql(sql).collect()
   
   
   def run_datafusion_single_partition():
       config = SessionConfig().with_target_partitions(1)
       ctx = SessionContext(config)
       ctx.register_table("tbl", dt)
       ctx.sql(sql).collect()
   ```
   I run the following query which has 0.1% selectivity 
   ```
   sql = """
   SELECT
     *
   FROM tbl
   WHERE
   ("MlRepoId" = 1089) AND ("TracingProjectId" = 
'222fde49-1f7a-4752-8ec1-06bcdbf570c5') AND ("TraceId" = 
'8728990bd3d11fa91a688e9d9964bca1') AND ("SpanId" = '82c0a65e80000450')
   """
   ```
   the results I got are really similar, there doesn't seem to be a performance 
regression in the python side
   ```
   Delta QueryBuilder: avg=0.002449s over 100 runs
   DataFusion: avg=0.002464s over 100 runs
   DataFusion (single partition): avg=0.001728s over 100 runs
   ```
   because I didn't see any performance issues, I didn't dive into the rust 
side yet.
   
   @timsaucer @ion-elgreco: do you think there is something else we need to 
check here? otherwise I think we can close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Query execution time difference between deltatable QueryBuilder and using DataFusion directly. [datafusion-python]

Reply via email to