Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2795473094 https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
scsmithr commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2031364896 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 20x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance `” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
Adez017 commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2786236203 its looks great !. 👍 @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb merged PR #67: URL: https://github.com/apache/datafusion-site/pull/67 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2794290825 Thanks everyone! Oneward! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2794289480 BTW I made a demo video here: https://www.youtube.com/watch?v=UYIC57hlL14 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036245247 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 20x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance `” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036251572 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 20x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI Review Comment: Me neither -- removed in 623404b -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
Omega359 commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036041501 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 20x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI Review Comment: Not sure why the asterix's are here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
Omega359 commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036041501 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 20x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI Review Comment: Not sure why the asterisks are here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
scsmithr commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2031336844 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 20x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in Review Comment: ```suggestion of part-time work. We began this project so we could easily generate TPC-H data in ``` ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + Review Comment: Possibly add a bolded tldr to draw people in with perf. ```suggestion **TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with dbgen** ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
scsmithr commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2031362029 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 20x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance `” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030224671 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 2 0x Review Comment: 🤦 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2781547765 Thanks @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030181076 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 2 0x Review Comment: ```suggestion development to build [tpchgen-rs], a fully open TPC-H data generator over 20x ``` ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 2 0x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which takes 44 minutes using [DuckDB]. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance `” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The ben
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030116177 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than Review Comment: That is a good point -- the 10x is a conservative estimate. It @clflushopt 's measurements (2m vs 44m is also 22x 🤔 ) The 10x is the smallest improvement on the gcp measurements (where the machine has more resources): https://docs.google.com/spreadsheets/d/14qTHR5zgqXq4BkhO1IUw2BPwBUIOqMXLZ2fUyOaPflI/edit?gid=0#gid=0 I'll update the text to say over 20x faster ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. Review Comment: Thank you -- added to the introduction ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the followi
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029943760 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
Xuanwo commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2781039446 Thanks a lot for building this! As a heavy user about TPC test suites, I will be interested in our upcoming plans: - Will we have python support so we can pip install it or even call it from python? - Will we plan to support TPC-DS in the future? - What's it relationship with DF? Will we have a built function called tpchgen? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029942249 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029940320 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any Review Comment: ```suggestion multiple results for “`TPCH Performance `” in any ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029939333 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029938609 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029938258 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029938078 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. +It is finally convenient and efficient to run TPC-H queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPC-H is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPC-H data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try it for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPC-H / dbgen? + +The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPC-H has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPC-H query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPC-H performance themselves. + +TPC-H simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029937599 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than Review Comment: throughput 1.4GB/s vs 0.05GB/sec = 28x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
clflushopt commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029924176 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. Review Comment: A less detailed more overview of the timing by running it as a script ``` jmp@comet ~/G/P/tpchgen-rs (main) [1]> time duckdb -init bench.sql -no-stdin -- Loading resources from bench.sql 100% ▕▏ ┌─┐ │ Success │ │ boolean │ ├─┤ │ 0 rows │ └─┘ Run Time (s): real 717.838 user 1787.775069 sys 83.976228 100% ▕▏ Run Time (s): real 5.471 user 11.592328 sys 12.049458 100% ▕▏ Run Time (s): real 1015.519 user 456.350229 sys 104.793712 Run Time (s): real 0.010 user 0.002420 sys 0.002541 100% ▕▏ Run Time (s): real 921.422 user 75.619220 sys 21.138167 100% ▕▏ Run Time (s): real 2.647 user 12.597744 sys 1.699293 100% ▕▏ Run Time (s): real 17.637 user 38.532114 sys 52.235873 Run Time (s): real 0.109 user 0.000459 sys 0.000680 Run Time (s): real 0.300 user 0.610369 sys 0.809258 Executed in 44.71 minsfish external usr time 39.72 mins0.16 millis 39.72 mins sys time4.63 mins1.44 millis4.63 mins ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
clflushopt commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029904651 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. Review Comment: ``` ~/G/P/tpchgen-rs (cl/feat/bump-crates-to-0.1.1)> duckdb v1.2.0 5f5512b827 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D INSTALL tpch; D LOAD tpch; D D .open test D .timer on D CALL dbgen(sf = 100); -- Export each table to Parquet format copy customer to 'customer.parquet' (FORMAT parquet); copy lineitem to 'lineitem.parquet' (FORMAT parquet); copy nation to 'supplier.parquet' (FORMAT parquet); copy orders to 'supplier.parquet' (FORMAT parquet); copy part to 'part.parquet' (FORMAT parquet); copy partsupp to 'partsupp.parquet' (FORMAT parquet); copy region to 'region.parquet' (FORMAT parquet); copy supplier to 'supplier.parquet' (FORMAT parquet); 100%▕ ┌─┐ │ Success │ │ boolean │ ├─┤ │ 0 rows │ └─┘ Run Time (s): real 687.659 user 1040.134996 sys 83.434645 D D -- Export each table to Parquet format D copy customer to 'customer.parquet' (FORMAT parquet); 100%▕███ Run Time (s): real 2.985 user 9.007536 sys 2.676405 D copy lineitem to 'lineitem.parquet' (FORMAT parquet); 100% ▕█ Run Time (s): real 108.328 user 395.075322 sys 90.606267 D copy nation to 'supplier.parquet' (FORMAT parquet); Run Time (s): real 0.010 user 0.002566 sys 0.001946 D copy orders to 'supplier.parquet' (FORMAT parquet); 100%▕███ Run Time (s): real 19.461 user 78.310336 sys 17.984476 D copy part to 'part.parquet' (FORMAT parquet); 100%▕███ Run Time (s): real 10.717 user 13.437501 sys 37.076887 D copy partsupp to 'partsupp.parquet' (FORMAT parquet); 100%▕ Run Time (s): real 11.286 user 31.645645 sys 14.751533 D copy region to 'region.parquet' (FORMAT parquet); Run Time (s): real 0.003 user 0.000454 sys 0.000350 D copy supplier to 'supplier.parquet' (FORMAT parquet); Run Time (s): real 0.500 user 0.613920 sys 0.130899 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029838372 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. Review Comment: @clflushopt, When reading this introduction I think it would also be nice to report the time taken to create Scale Factor 100 using DuckDB (my laptop has more memory and I think the numbers with more modest specs are more compelling) Is there any chance you can time how long it takes to run this script on your machine? ```sql INSTALL tpch; LOAD tpch; .open test .timer on CALL dbgen(sf = 100); -- Export each table to Parquet format copy customer to 'customer.parquet' (FORMAT parquet); copy lineitem to 'lineitem.parquet' (FORMAT parquet); copy nation to 'supplier.parquet' (FORMAT parquet); copy orders to 'supplier.parquet' (FORMAT parquet); copy part to 'part.parquet' (FORMAT parquet); copy partsupp to 'partsupp.parquet' (FORMAT parquet); copy region to 'region.parquet' (FORMAT parquet); copy supplier to 'supplier.parquet' (FORMAT parquet); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029836024 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029835787 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try if for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPCH / dbgen? + +The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPCH query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPCH performance themselves. + +TPCH simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://githu
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029835746 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try if for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPCH / dbgen? + +The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPCH query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPCH performance themselves. + +TPCH simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://githu
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029835566 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try if for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPCH / dbgen? + +The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPCH query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPCH performance themselves. + +TPCH simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https://githu
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029834605 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing Review Comment: Updated in 12df7fa -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029834323 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x Review Comment: Thank you -- updated in 4eec529 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
XiangpengHao commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2779791930 This is a great tool; I've been hoping for it for years! Nice work! I previously relied on DuckDB to generate TPC-H (as [suggested](https://xuanwo.io/links/2025/02/duckdb-is-the-best-tpc-data-generator/) by @Xuanwo), and now we finally have a rusty version! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
XiangpengHao commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029469488 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try if for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPCH / dbgen? + +The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPCH query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPCH performance themselves. + +TPCH simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https:
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
XiangpengHao commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029467839 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try if for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPCH / dbgen? + +The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPCH query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPCH performance themselves. + +TPCH simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https:
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
XiangpengHao commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029467529 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try if for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPCH / dbgen? + +The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majority of analytic database papers and industrial systems still +use TPCH query performance benchmarks as a baseline. You will inevitably find +multiple results for “`TPCH Performance`” in any +search engine. + +The benchmark was created at a time when access to high performance analytical +systems was not widespread, so the [Transaction Processing Performance Council] +defined a process of formal result verification. More recently, given the broad +availability of free and open source database systems, it is common for users to +run and verify TPCH performance themselves. + +TPCH simulates a business environment with eight tables: `REGION`, `NATION`, +`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These +tables are linked by foreign keys in a normalized schema representing a supply +chain with parts, suppliers, customers and orders. The benchmark itself is 22 +SQL queries containing joins, aggregations, and sorting operations. + +The queries run against data created with [dbgen], a program +written in a pre [C-99] dialect, which generates data in a format called *TBL* +(example in Figure 2). `dbgen` creates data for each of the 8 tables for a +certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and +corresponding dataset sizes are shown in Table 1. There is no theoretical upper +bound on the Scale Factor. + +[TPC-H]: https://www.tpc.org/tpch/ +[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing +[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf +[Transaction Processing Performance Council]: https://www.tpc.org/ +[dbgen]: https:
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
andygrove commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029454642 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x Review Comment: According to the specification, it is `TPC-H` rather than `TPCH`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]
timsaucer commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029415412 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust Review Comment: The title doesn't render correctly. We may just need to remove the back ticks and accept the formatting. https://datafusion.staged.apache.org/blog/2025/04/10/fastest-tpch-generator/ ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing Review Comment: Nit: It wasn't clear to me if this is "finally" as in "my final argument" or as in "we can finally do..." If the intent is the latter, maybe "It is finally convenient" instead ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,617 @@ +--- +layout: post +title: `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + + + + +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} + + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPCH data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. + + + +Finally, it is convenient and efficient to run TPCH queries locally when testing +analytical engines such as DataFusion. + + + +**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP +VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and +[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10 +minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure +DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was +available on our test machine. The testing methodology is in the +[documentation]. + +[DuckDB]: https://duckdb.org +[requires 647 GB of RAM]: https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator +[documentation]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md + +This blog explains what TPCH is, how we ported the vintage C data generator to +Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks +of part-time work. We began this project so we can easily generate TPCH data in +[Apache DataFusion] and [GlareDB]. + +[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/ +[Apache DataFusion]: https://datafusion.apache.org/ +[GlareDB]: https://glaredb.com/ + +# Try if for yourself + +The tool is entirely open source under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by run the following commands after [installing Rust](https://www.rust-lang.org/tools/install): + +```shell +$ cargo install tpchgen-cli + +# create SF=1 in classic TBL format +$ tpchgen-cli -s 1 + +# create SF=10 in Parquet +$ tpchgen-cli -s 10 --format=parquet +``` + +# What is TPCH / dbgen? + +The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the +performance of database systems on [OLAP] queries*, *the kind used to build BI +dashboards. + +TPCH has become a de facto standard for analytic systems. While there are [well +known] limitations as the data and queries do not well represent many real world +use cases, the majori