wegamekinglc opened a new issue, #1186: URL: https://github.com/apache/datafusion-python/issues/1186
### Describe the bug Hi team, I have encountered a performance issue when I run same query on a big table with datafusion comparing with DuckDB. I will try to simplify my case and replicate the issue in my following codes. ### To Reproduce ```python import timeit import numpy as np import pyarrow as pa import datafusion from datafusion import SessionContext import duckdb print(duckdb.__version__) print(datafusion.__version__) # prepare data batches = 100000 names = list("abcdefghijklmnopqrstuvwxyz") names = [n + m for n in names for m in names] names_array = pa.concat_arrays([pa.array(names)] * batches) values_array = pa.concat_arrays([pa.array(np.random.randint(1, 100, len(names))) for _ in range(batches)]) pa_table = pa.Table.from_arrays([names_array, values_array], names=["name", "value"]) # prepare query sql = "select name, sum(value) as value FROM pa_table group by name;" n_round = 10 # duckb elapsed = timeit.timeit('duckdb.sql(sql).to_arrow_table()', number=n_round , globals=globals()) duckdb_per_round = elapsed / n_round # datafusion ctx = SessionContext() _ = ctx.from_arrow(pa_table, "pa_table") elapsed = timeit.timeit('ctx.sql(sql).to_arrow_table()', number=n_round , globals=globals()) datafusion_per_round= elapsed / n_round # result print(f"{'duckdb':<12}: {duckdb_per_round * 1000:.2f}ms") print(f"{'datafusion':<12}: {datafusion_per_round * 1000:.2f}ms") ``` the output will look like: ```bash 1.3.1 47.0.0 duckdb : 152.15ms datafusion : 1002.04ms ``` ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org