wegamekinglc opened a new issue, #1186:
URL: https://github.com/apache/datafusion-python/issues/1186

   ### Describe the bug
   
   Hi team, I have encountered a performance issue when I run same query on a 
big table with datafusion comparing with DuckDB.
   
   I will try to simplify my case and replicate the issue in my following codes.
   
   
   ### To Reproduce
   
   ```python
   import timeit
   import numpy as np
   import pyarrow as pa
   import datafusion
   from datafusion import SessionContext
   import duckdb
   
   print(duckdb.__version__)
   print(datafusion.__version__)
   
   # prepare data
   
   batches = 100000
   
   names = list("abcdefghijklmnopqrstuvwxyz")
   names = [n + m for n in names for m in names]
   
   names_array = pa.concat_arrays([pa.array(names)] * batches)
   values_array = pa.concat_arrays([pa.array(np.random.randint(1, 100, 
len(names))) for _ in range(batches)])
   
   pa_table = pa.Table.from_arrays([names_array, values_array], names=["name", 
"value"])
   
   # prepare query
   sql = "select name, sum(value) as value FROM pa_table group by name;"
   n_round = 10
   
   # duckb
   elapsed = timeit.timeit('duckdb.sql(sql).to_arrow_table()', number=n_round , 
globals=globals())
   duckdb_per_round = elapsed  / n_round
   
   # datafusion
   ctx = SessionContext()
   _ = ctx.from_arrow(pa_table, "pa_table")
   elapsed  = timeit.timeit('ctx.sql(sql).to_arrow_table()', number=n_round , 
globals=globals())
   datafusion_per_round= elapsed  / n_round
   
   # result
   print(f"{'duckdb':<12}: {duckdb_per_round * 1000:.2f}ms")
   print(f"{'datafusion':<12}: {datafusion_per_round * 1000:.2f}ms")
   ```
   
   the output will look like:
   ```bash
   1.3.1
   47.0.0
   duckdb      : 152.15ms
   datafusion  : 1002.04ms
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to