renato2099 commented on issue #1140: URL: https://github.com/apache/datafusion-python/issues/1140#issuecomment-4109057625
I gave this a try with the following: ``` deltalake 1.5.0 datafusion 52.3.0 pyarrow 20.0.0 ``` generated some data with the following script [generate_delta_traces.py](https://github.com/user-attachments/files/26177626/generate_delta_traces.py) and run some queries against that delta file, no special datafusion configuration ``` def run_delta_query_builder(): QueryBuilder().register("tbl", dt).execute(sql).read_all() def run_datafusion(): ctx = SessionContext() ctx.register_table("tbl", dt) ctx.sql(sql).collect() def run_datafusion_single_partition(): config = SessionConfig().with_target_partitions(1) ctx = SessionContext(config) ctx.register_table("tbl", dt) ctx.sql(sql).collect() ``` I run the following query which has 0.1% selectivity ``` sql = """ SELECT * FROM tbl WHERE ("MlRepoId" = 1089) AND ("TracingProjectId" = '222fde49-1f7a-4752-8ec1-06bcdbf570c5') AND ("TraceId" = '8728990bd3d11fa91a688e9d9964bca1') AND ("SpanId" = '82c0a65e80000450') """ ``` the results I got are really similar, there doesn't seem to be a performance regression in the python side ``` Delta QueryBuilder: avg=0.002449s over 100 runs DataFusion: avg=0.002464s over 100 runs DataFusion (single partition): avg=0.001728s over 100 runs ``` because I didn't see any performance issues, I didn't dive into the rust side yet. @timsaucer @ion-elgreco: do you think there is something else we need to check here? otherwise I think we can close this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
