Re: [I] Optimize the join operators [datafusion]

via GitHub Wed, 16 Jul 2025 21:02:23 -0700


zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082400648


   > > > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join 
results on my M3 Macbook with 16GB of RAM:
   > > > 
   > > > 
   > > > [@MrPowers](https://github.com/MrPowers) I am using the **1e8** 
dataset.
   > > > ```
   > > > target/release/dfbench h2o --iterations 5 --join-paths 
/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv
 --queries-path /home/lv/code/datafusion/benchmarks/queries/h2o/join.sql -o 
/home/lv/code/datafusion/benchmarks/results/join/h2o_join_exch_join_order.json
   > > > 
   > > > 
   > > >  devhomeinsp  ~/c/datafusion > wc benchmarks/data/h2o/J1_1e8_*         
   
   > > >        101        101       2841 benchmarks/data/h2o/J1_1e8_1e2_0.csv
   > > >     100001     100001    4420917 benchmarks/data/h2o/J1_1e8_1e5_0.csv
   > > >  100000001  100000001 6619091495 benchmarks/data/h2o/J1_1e8_1e8_NA.csv
   > > >        101        101       6661 benchmarks/data/h2o/J1_1e8_NA_0.csv
   > > >  100100204  100100204 6623521914 total
   > > > ```
   > > 
   > > 
   > > I got similar result for my Mac M4 48GB, i believe we need to limit 16GB 
to mock the 16GB result.
   > 
   > [@zhuqi-lucas](https://github.com/zhuqi-lucas) Even after manually 
limiting my VM's memory to 16G, the execution time didn't change much (the 
longer execution time for q1 on the first run might be because the CSV file 
wasn't in the cache ?).
   > 
   > Output
   > ```
   > 
   > √ devhomeinsp  ~/c/datafusion > /usr/bin/time -v target/release/dfbench 
h2o --iterations 5 --join-paths 
/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv
 --queries-path /home/lv/code/datafusion/benchmarks/queries/h2o/join.sql -o 
/dev/null                                                      Running 
benchmarks with the following options: RunOpt { query: None, common: CommonOpt 
{ iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", 
memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, 
queries_path: "/home/lv/code/datafusion/benchmarks/queries/h2o/join.sql", path: 
"benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths: 
"/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/
 
lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv",
 output_path: Some("/dev/null") }
   > Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, 
x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1;
   > Query 1 iteration 1 took 34.4 ms and returned 90 rows
   > Query 1 iteration 2 took 1.9 ms and returned 90 rows
   > Query 1 iteration 3 took 1.8 ms and returned 90 rows
   > Query 1 iteration 4 took 1.6 ms and returned 90 rows
   > Query 1 iteration 5 took 1.3 ms and returned 90 rows
   > Query 1 avg time: 8.23 ms
   > Q2: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as 
xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, 
x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2;
   > Query 2 iteration 1 took 15.1 ms and returned 89 rows
   > Query 2 iteration 2 took 12.8 ms and returned 89 rows
   > Query 2 iteration 3 took 12.7 ms and returned 89 rows
   > Query 2 iteration 4 took 12.4 ms and returned 89 rows
   > Query 2 iteration 5 took 12.5 ms and returned 89 rows
   > Query 2 avg time: 13.08 ms
   > Q3: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as 
xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, 
x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2;
   > Query 3 iteration 1 took 13.5 ms and returned 100 rows
   > Query 3 iteration 2 took 13.2 ms and returned 100 rows
   > Query 3 iteration 3 took 12.1 ms and returned 100 rows
   > Query 3 iteration 4 took 14.2 ms and returned 100 rows
   > Query 3 iteration 5 took 13.1 ms and returned 100 rows
   > Query 3 avg time: 13.19 ms
   > Q4: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as 
xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, 
x.v1, medium.v2 FROM x JOIN medium ON x.id5 = medium.id5;
   > Query 4 iteration 1 took 12.1 ms and returned 89 rows
   > Query 4 iteration 2 took 11.6 ms and returned 89 rows
   > Query 4 iteration 3 took 12.0 ms and returned 89 rows
   > Query 4 iteration 4 took 11.5 ms and returned 89 rows
   > Query 4 iteration 5 took 11.8 ms and returned 89 rows
   > Query 4 avg time: 11.80 ms
   > Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 
as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, 
large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 
FROM x JOIN large ON x.id3 = large.id3;
   > Query 5 iteration 1 took 3181.6 ms and returned 92 rows
   > Query 5 iteration 2 took 2234.5 ms and returned 92 rows
   > Query 5 iteration 3 took 2320.8 ms and returned 92 rows
   > Query 5 iteration 4 took 2246.6 ms and returned 92 rows
   > Query 5 iteration 5 took 2243.5 ms and returned 92 rows
   > Query 5 avg time: 2445.39 ms
   >         Command being timed: "target/release/dfbench h2o --iterations 5 
--join-paths 
/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv
 --queries-path /home/lv/code/datafusion/benchmarks/queries/h2o/join.sql -o 
/dev/null"
   >         User time (seconds): 247.85
   >         System time (seconds): 21.28
   >         Percent of CPU this job got: 2151%
   >         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.51
   >         Average shared text size (kbytes): 0
   >         Average unshared data size (kbytes): 0
   >         Average stack size (kbytes): 0
   >         Average total size (kbytes): 0
   >         Maximum resident set size (kbytes): 322980
   >         Average resident set size (kbytes): 0
   >         Major (requiring I/O) page faults: 260
   >         Minor (reclaiming a frame) page faults: 33992
   >         Voluntary context switches: 58558
   >         Involuntary context switches: 7478
   >         Swaps: 0
   >         File system inputs: 13006232
   >         File system outputs: 0
   >         Socket messages sent: 0
   >         Socket messages received: 0
   >         Signals delivered: 0
   >         Page size (bytes): 4096
   >         Exit status: 0
   > 
   > √ devhomeinsp  ~/c/datafusion > free -h
   >                total        used        free      shared  buff/cache   
available
   > Mem:            11Gi       1.1Gi       3.2Gi        13Mi       7.1Gi       
 10Gi
   > Swap:             0B          0B          0B
   > ```
   
   Interesting @UBarney , i also can't reproduce the above data for my local 
run (Mac M4 48G), so we may need to wait the parquet format result, but in 
theory parquet should be faster than csv. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Optimize the join operators [datafusion]

Reply via email to