zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082400648
> > > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: > > > > > > > > > [@MrPowers](https://github.com/MrPowers) I am using the **1e8** dataset. > > > ``` > > > target/release/dfbench h2o --iterations 5 --join-paths /home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv --queries-path /home/lv/code/datafusion/benchmarks/queries/h2o/join.sql -o /home/lv/code/datafusion/benchmarks/results/join/h2o_join_exch_join_order.json > > > > > > > > > devhomeinsp ~/c/datafusion > wc benchmarks/data/h2o/J1_1e8_* > > > 101 101 2841 benchmarks/data/h2o/J1_1e8_1e2_0.csv > > > 100001 100001 4420917 benchmarks/data/h2o/J1_1e8_1e5_0.csv > > > 100000001 100000001 6619091495 benchmarks/data/h2o/J1_1e8_1e8_NA.csv > > > 101 101 6661 benchmarks/data/h2o/J1_1e8_NA_0.csv > > > 100100204 100100204 6623521914 total > > > ``` > > > > > > I got similar result for my Mac M4 48GB, i believe we need to limit 16GB to mock the 16GB result. > > [@zhuqi-lucas](https://github.com/zhuqi-lucas) Even after manually limiting my VM's memory to 16G, the execution time didn't change much (the longer execution time for q1 on the first run might be because the CSV file wasn't in the cache ?). > > Output > ``` > > √ devhomeinsp ~/c/datafusion > /usr/bin/time -v target/release/dfbench h2o --iterations 5 --join-paths /home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv --queries-path /home/lv/code/datafusion/benchmarks/queries/h2o/join.sql -o /dev/null Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, queries_path: "/home/lv/code/datafusion/benchmarks/queries/h2o/join.sql", path: "benchmarks/data/h2o/G1_1e7_1e7_100_0.csv", join_paths: "/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/ lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv", output_path: Some("/dev/null") } > Q1: SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1; > Query 1 iteration 1 took 34.4 ms and returned 90 rows > Query 1 iteration 2 took 1.9 ms and returned 90 rows > Query 1 iteration 3 took 1.8 ms and returned 90 rows > Query 1 iteration 4 took 1.6 ms and returned 90 rows > Query 1 iteration 5 took 1.3 ms and returned 90 rows > Query 1 avg time: 8.23 ms > Q2: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2; > Query 2 iteration 1 took 15.1 ms and returned 89 rows > Query 2 iteration 2 took 12.8 ms and returned 89 rows > Query 2 iteration 3 took 12.7 ms and returned 89 rows > Query 2 iteration 4 took 12.4 ms and returned 89 rows > Query 2 iteration 5 took 12.5 ms and returned 89 rows > Query 2 avg time: 13.08 ms > Q3: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2; > Query 3 iteration 1 took 13.5 ms and returned 100 rows > Query 3 iteration 2 took 13.2 ms and returned 100 rows > Query 3 iteration 3 took 12.1 ms and returned 100 rows > Query 3 iteration 4 took 14.2 ms and returned 100 rows > Query 3 iteration 5 took 13.1 ms and returned 100 rows > Query 3 avg time: 13.19 ms > Q4: SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x JOIN medium ON x.id5 = medium.id5; > Query 4 iteration 1 took 12.1 ms and returned 89 rows > Query 4 iteration 2 took 11.6 ms and returned 89 rows > Query 4 iteration 3 took 12.0 ms and returned 89 rows > Query 4 iteration 4 took 11.5 ms and returned 89 rows > Query 4 iteration 5 took 11.8 ms and returned 89 rows > Query 4 avg time: 11.80 ms > Q5: SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x JOIN large ON x.id3 = large.id3; > Query 5 iteration 1 took 3181.6 ms and returned 92 rows > Query 5 iteration 2 took 2234.5 ms and returned 92 rows > Query 5 iteration 3 took 2320.8 ms and returned 92 rows > Query 5 iteration 4 took 2246.6 ms and returned 92 rows > Query 5 iteration 5 took 2243.5 ms and returned 92 rows > Query 5 avg time: 2445.39 ms > Command being timed: "target/release/dfbench h2o --iterations 5 --join-paths /home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_NA_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e2_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e5_0.csv,/home/lv/code/datafusion/benchmarks/data/h2o/J1_1e8_1e8_NA.csv --queries-path /home/lv/code/datafusion/benchmarks/queries/h2o/join.sql -o /dev/null" > User time (seconds): 247.85 > System time (seconds): 21.28 > Percent of CPU this job got: 2151% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.51 > Average shared text size (kbytes): 0 > Average unshared data size (kbytes): 0 > Average stack size (kbytes): 0 > Average total size (kbytes): 0 > Maximum resident set size (kbytes): 322980 > Average resident set size (kbytes): 0 > Major (requiring I/O) page faults: 260 > Minor (reclaiming a frame) page faults: 33992 > Voluntary context switches: 58558 > Involuntary context switches: 7478 > Swaps: 0 > File system inputs: 13006232 > File system outputs: 0 > Socket messages sent: 0 > Socket messages received: 0 > Signals delivered: 0 > Page size (bytes): 4096 > Exit status: 0 > > √ devhomeinsp ~/c/datafusion > free -h > total used free shared buff/cache available > Mem: 11Gi 1.1Gi 3.2Gi 13Mi 7.1Gi 10Gi > Swap: 0B 0B 0B > ``` Interesting @UBarney , i also can't reproduce the above data for my local run (Mac M4 48G), so we may need to wait the parquet format result, but in theory parquet should be faster than csv. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org