alamb commented on PR #6801:
URL: 
https://github.com/apache/arrow-datafusion/pull/6801#issuecomment-1622256185

   Hi @2010YOUY01  -- I am having trouble reproducing the benchmark results you 
reported
   
   # Results
   Master:
   ```
   ❯ select count(*) from lineitem where l_quantity < 10;
   1 row in set. Query took 1.424 seconds.
   1 row in set. Query took 1.374 seconds.
   1 row in set. Query took 1.409 seconds.
   ```
   
   This PR branch:
   ```
   ❯ select count(*) from lineitem where l_quantity < 10;
   1 row in set. Query took 1.918 seconds.
   1 row in set. Query took 1.672 seconds.
   1 row in set. Query took 2.008 seconds.
   ```
   (I also merged up your branch from master and it still had the same 
performance)
   
   # Methodology:
   
   I tested this branch out using the TPCH SF1 (6M rows,  725 MB) lineitem CSV 
file (created with `arrow-datafusion/benchmarks$ ./bench.sh data tpch`):
   
   ```shell
   (arrow_dev) alamb@MacBook-Pro-8:~$ du -h  
/Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl
   725M /Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl
   
   (arrow_dev) alamb@MacBook-Pro-8:~$ wc -l 
/Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl
    6001215 /Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl
   ```
   
   And used `datafusion-cli` (built via `cargo build --release`)
   
   ```sql
   CREATE EXTERNAL TABLE lineitem (
           l_orderkey BIGINT,
           l_partkey BIGINT,
           l_suppkey BIGINT,
           l_linenumber INTEGER,
           l_quantity DECIMAL(15, 2),
           l_extendedprice DECIMAL(15, 2),
           l_discount DECIMAL(15, 2),
           l_tax DECIMAL(15, 2),
           l_returnflag VARCHAR,
           l_linestatus VARCHAR,
           l_shipdate DATE,
           l_commitdate DATE,
           l_receiptdate DATE,
           l_shipinstruct VARCHAR,
           l_shipmode VARCHAR,
           l_comment VARCHAR,
           l_rev VARCHAR,
   ) STORED AS CSV DELIMITER '|' LOCATION 
'/Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl';
   
   --- Run a query that scans the entire CSV
   select count(*) from lineitem where l_quantity < 10;
   
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 1079240         |
   +-----------------+
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to