Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-10 Thread via GitHub


kevinjqliu commented on PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2795473094

   https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-10 Thread via GitHub


scsmithr commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2031364896


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-10 Thread via GitHub


Adez017 commented on PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2786236203

   its looks great !. 👍
   @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-10 Thread via GitHub


alamb merged PR #67:
URL: https://github.com/apache/datafusion-site/pull/67


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-10 Thread via GitHub


alamb commented on PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2794290825

   Thanks everyone! Oneward!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-10 Thread via GitHub


alamb commented on PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2794289480

   BTW I made a demo video here: https://www.youtube.com/watch?v=UYIC57hlL14 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-09 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036245247


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-09 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036251572


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI

Review Comment:
   Me neither -- removed in 623404b



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-09 Thread via GitHub


Omega359 commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036041501


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI

Review Comment:
   Not sure why the asterix's are here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-09 Thread via GitHub


Omega359 commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2036041501


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI

Review Comment:
   Not sure why the asterisks are here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-07 Thread via GitHub


scsmithr commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2031336844


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in

Review Comment:
   ```suggestion
   of part-time work. We began this project so we could easily generate TPC-H 
data in
   ```



##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+

Review Comment:
   Possibly add a bolded tldr to draw people in with perf.
   
   ```suggestion
   **TLDR: TPC-H SF=100 in 1min using tpchgen-rs vs 30min+ with dbgen**
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-07 Thread via GitHub


scsmithr commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2031362029


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030224671


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 2 0x

Review Comment:
   🤦  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub


alamb commented on PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2781547765

   Thanks @kevinjqliu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030181076


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 2 0x

Review Comment:
   ```suggestion
   development to build [tpchgen-rs], a fully open TPC-H data generator over 20x
   ```



##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,613 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 2 0x
+faster than any other implementation we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format, which 
takes 44 minutes using [DuckDB].
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM with 88GB of memory. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 
14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The ben

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030116177


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than

Review Comment:
   That is a good point -- the 10x is a conservative estimate.
   
   It @clflushopt 's measurements (2m vs 44m is also 22x 🤔 )
   
   The 10x is the smallest improvement on the gcp measurements (where the 
machine has more resources): 
https://docs.google.com/spreadsheets/d/14qTHR5zgqXq4BkhO1IUw2BPwBUIOqMXLZ2fUyOaPflI/edit?gid=0#gid=0
   
   I'll update the text to say over 20x faster



##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.

Review Comment:
   Thank you -- added to the introduction



##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the followi

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029943760


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


Xuanwo commented on PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2781039446

   Thanks a lot for building this! As a heavy user about TPC test suites, I 
will be interested in our upcoming plans:
   
   - Will we have python support so we can pip install it or even call it from 
python?
   - Will we plan to support TPC-DS in the future?
   - What's it relationship with DF? Will we have a built function called 
tpchgen?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029942249


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029940320


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any

Review Comment:
   ```suggestion
   multiple results for  “`TPCH Performance `” in any
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029939333


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029938609


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029938258


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029938078


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+It is finally convenient and efficient to run TPC-H queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPC-H is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPC-H data 
in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try it for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPC-H / dbgen?
+
+The popular [TPC-H] benchmark (often referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPC-H has become a de facto standard for analytic systems. While there are 
[well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPC-H performance themselves.
+
+TPC-H simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


kevinjqliu commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029937599


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than

Review Comment:
   throughput 1.4GB/s vs 0.05GB/sec = 28x 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


clflushopt commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029924176


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.

Review Comment:
   A less detailed more overview of the timing by running it as a script
   
   ```
   
   jmp@comet ~/G/P/tpchgen-rs (main) [1]> time duckdb -init bench.sql -no-stdin
   -- Loading resources from bench.sql
   100% ▕▏
   ┌─┐
   │ Success │
   │ boolean │
   ├─┤
   │ 0 rows  │
   └─┘
   Run Time (s): real 717.838 user 1787.775069 sys 83.976228
   100% ▕▏
   Run Time (s): real 5.471 user 11.592328 sys 12.049458
   100% ▕▏
   Run Time (s): real 1015.519 user 456.350229 sys 104.793712
   Run Time (s): real 0.010 user 0.002420 sys 0.002541
   100% ▕▏
   Run Time (s): real 921.422 user 75.619220 sys 21.138167
   100% ▕▏
   Run Time (s): real 2.647 user 12.597744 sys 1.699293
   100% ▕▏
   Run Time (s): real 17.637 user 38.532114 sys 52.235873
   Run Time (s): real 0.109 user 0.000459 sys 0.000680
   Run Time (s): real 0.300 user 0.610369 sys 0.809258
   
   
   Executed in   44.71 minsfish   external
  usr time   39.72 mins0.16 millis   39.72 mins
  sys time4.63 mins1.44 millis4.63 mins
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


clflushopt commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029904651


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.

Review Comment:
   ```
   ~/G/P/tpchgen-rs (cl/feat/bump-crates-to-0.1.1)> duckdb
   v1.2.0 5f5512b827
   Enter ".help" for usage hints.
   Connected to a transient in-memory database.
   Use ".open FILENAME" to reopen on a persistent database.
   D INSTALL tpch;
   D LOAD tpch;
   D
   D .open test
   D .timer on
   D CALL dbgen(sf = 100);
   
   -- Export each table to Parquet format
   copy customer to 'customer.parquet' (FORMAT parquet);
   copy lineitem to 'lineitem.parquet' (FORMAT parquet);
   copy nation to 'supplier.parquet' (FORMAT parquet);
   copy orders to 'supplier.parquet' (FORMAT parquet);
   copy part to 'part.parquet' (FORMAT parquet);
   copy partsupp to 'partsupp.parquet' (FORMAT parquet);
   copy region to 'region.parquet' (FORMAT parquet);
   copy supplier to 'supplier.parquet' (FORMAT parquet);
   
   100%▕
   ┌─┐
 │ Success │
 │ boolean │
├─┤
 │ 0 rows  │
   └─┘
   Run Time (s): real 687.659 user 1040.134996 sys 83.434645
   D
   D -- Export each table to Parquet format
   D copy customer to 'customer.parquet' (FORMAT parquet);
   100%▕███
   Run Time (s): real 2.985 user 9.007536 sys 2.676405
   D copy lineitem to 'lineitem.parquet' (FORMAT parquet);
   100% ▕█
   Run Time (s): real 108.328 user 395.075322 sys 90.606267
   D copy nation to 'supplier.parquet' (FORMAT parquet);
   Run Time (s): real 0.010 user 0.002566 sys 0.001946
   D copy orders to 'supplier.parquet' (FORMAT parquet);
   100%▕███
   Run Time (s): real 19.461 user 78.310336 sys 17.984476
   D copy part to 'part.parquet' (FORMAT parquet);
   100%▕███
   Run Time (s): real 10.717 user 13.437501 sys 37.076887
   D copy partsupp to 'partsupp.parquet' (FORMAT parquet);
   100%▕
   Run Time (s): real 11.286 user 31.645645 sys 14.751533
   D copy region to 'region.parquet' (FORMAT parquet);
   Run Time (s): real 0.003 user 0.000454 sys 0.000350
   D copy supplier to 'supplier.parquet' (FORMAT parquet);
   Run Time (s): real 0.500 user 0.613920 sys 0.130899
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029838372


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,614 @@
+---
+layout: post
+title: tpchgen-rs World’s fastest open source TPC-H data generator, written in 
Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPC-H data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes 
less than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.

Review Comment:
   @clflushopt,
   
   When reading this introduction I think it would also be nice to report the 
time taken to create Scale Factor 100 using DuckDB (my laptop has more memory 
and I think the numbers with more modest specs are more compelling)
   
Is there any chance you can time how long it takes to run this script on 
your machine?
   
   ```sql
   INSTALL tpch;
   LOAD tpch;
   
   .open test
   .timer on
   CALL dbgen(sf = 100);
   
   -- Export each table to Parquet format
   copy customer to 'customer.parquet' (FORMAT parquet);
   copy lineitem to 'lineitem.parquet' (FORMAT parquet);
   copy nation to 'supplier.parquet' (FORMAT parquet);
   copy orders to 'supplier.parquet' (FORMAT parquet);
   copy part to 'part.parquet' (FORMAT parquet);
   copy partsupp to 'partsupp.parquet' (FORMAT parquet);
   copy region to 'region.parquet' (FORMAT parquet);
   copy supplier to 'supplier.parquet' (FORMAT parquet);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029836024


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029835787


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try if for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPCH / dbgen?
+
+The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.
+
+TPCH simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://githu

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029835746


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try if for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPCH / dbgen?
+
+The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.
+
+TPCH simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://githu

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029835566


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try if for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPCH / dbgen?
+
+The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.
+
+TPCH simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https://githu

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029834605


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing

Review Comment:
   Updated in 12df7fa



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-05 Thread via GitHub


alamb commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029834323


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x

Review Comment:
   Thank you -- updated in 4eec529



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub


XiangpengHao commented on PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2779791930

   This is a great tool; I've been hoping for it for years! Nice work!
   
   I previously relied on DuckDB to generate TPC-H (as 
[suggested](https://xuanwo.io/links/2025/02/duckdb-is-the-best-tpc-data-generator/)
 by @Xuanwo), and now we finally have a rusty version!  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub


XiangpengHao commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029469488


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try if for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPCH / dbgen?
+
+The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.
+
+TPCH simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https:

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub


XiangpengHao commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029467839


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try if for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPCH / dbgen?
+
+The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.
+
+TPCH simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https:

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub


XiangpengHao commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029467529


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try if for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPCH / dbgen?
+
+The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majority of analytic database papers and industrial systems 
still
+use TPCH query performance benchmarks as a baseline. You will inevitably find
+multiple results for  “`TPCH Performance `” in any
+search engine.
+
+The benchmark was created at a time when access to high performance analytical
+systems was not widespread, so the [Transaction Processing Performance Council]
+defined a process of formal result verification. More recently, given the broad
+availability of free and open source database systems, it is common for users 
to
+run and verify TPCH performance themselves.
+
+TPCH simulates a business environment with eight tables: `REGION`, `NATION`,
+`SUPPLIER`, `CUSTOMER`, `PART`, `PARTSUPP`, `ORDERS`, and `LINEITEM`. These
+tables are linked by foreign keys in a normalized schema representing a supply
+chain with parts, suppliers, customers and orders. The benchmark itself is 22
+SQL queries containing joins, aggregations, and sorting operations.
+
+The queries run against data created with [dbgen], a program
+written in a pre [C-99] dialect, which generates data in a format called *TBL*
+(example in Figure 2). `dbgen` creates data for each of the 8 tables for a
+certain *Scale Factor*, commonly abbreviated as SF. Example Scale Factors and
+corresponding dataset sizes are shown in Table 1. There is no theoretical upper
+bound on the Scale Factor.
+
+[TPC-H]: https://www.tpc.org/tpch/
+[OLAP]: https://en.wikipedia.org/wiki/Online_analytical_processing
+[well known]: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
+[Transaction Processing Performance Council]: https://www.tpc.org/
+[dbgen]: https:

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub


andygrove commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029454642


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x

Review Comment:
   According to the specification, it is `TPC-H` rather than `TPCH`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-04 Thread via GitHub


timsaucer commented on code in PR #67:
URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029415412


##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust

Review Comment:
   The title doesn't render correctly. We may just need to remove the back 
ticks and accept the formatting.
   
   https://datafusion.staged.apache.org/blog/2025/04/10/fastest-tpch-generator/



##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing

Review Comment:
   Nit: It wasn't clear to me if this is "finally" as in "my final argument" or 
as in "we can finally do..." If the intent is the latter, maybe "It is finally 
convenient" instead



##
content/blog/2025-04-10-fastest-tpch-generator.md:
##
@@ -0,0 +1,617 @@
+---
+layout: post
+title: `tpchgen-rs` World’s fastest open source TPCH data generator, written 
in Rust
+date: 2025-04-10
+author: Andrew Lamb, Achraf B, and Sean Smith
+categories: [performance]
+---
+
+
+
+
+/* Table borders */
+table, th, td {
+  border: 1px solid black;
+  border-collapse: collapse;
+}
+th, td {
+  padding: 3px;
+}
+
+
+3 members of the [Apache DataFusion] community used Rust and open source
+development to build [tpchgen-rs], a fully open TPCH data generator over 10x
+faster than any other implementation  we know of.
+
+It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s
+😎) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen`
+which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
+2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format.
+
+
+
+Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+analytical engines such as DataFusion.
+
+
+
+**Figure 1**: Time to create TPCH dataset for Scale Factor (see below) 1, 10,
+100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
+VM. For Scale Factor(SF) 100 `tpchgen` takes 1 minute and 14 seconds and
+[DuckDB] takes 17 minutes and 48 seconds. For SF=1000, `tpchgen` takes 10
+minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
+DuckDB’s time as it [requires 647 GB of RAM], more than the 88 GB that was
+available on our test machine. The testing methodology is in the
+[documentation].
+
+[DuckDB]: https://duckdb.org
+[requires 647 GB of RAM]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+[documentation]: 
https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+This blog explains what TPCH is, how we ported the vintage C data generator to
+Rust (yes, [RWIR]) and optimized its performance over the course of a few weeks
+of part-time work. We began this project so we can easily generate TPCH data in
+[Apache DataFusion] and [GlareDB].
+
+[RWIR]: https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/
+[Apache DataFusion]: https://datafusion.apache.org/
+[GlareDB]: https://glaredb.com/
+
+# Try if for yourself
+
+The tool is entirely open source under the [Apache 2.0 
license](https://www.apache.org/licenses/LICENSE-2.0). Visit the [tpchgen-rs 
repository](https://github.com/clflushopt/tpchgen-rs) or try it for yourself by 
run the following commands after [installing 
Rust](https://www.rust-lang.org/tools/install):
+
+```shell
+$ cargo install tpchgen-cli
+
+# create SF=1 in classic TBL format
+$ tpchgen-cli -s 1 
+
+# create SF=10 in Parquet
+$ tpchgen-cli -s 10 --format=parquet
+```
+
+# What is TPCH / dbgen?
+
+The popular [TPC-H] benchmark (commonly referred to as TPCH) helps evaluate the
+performance of database systems on [OLAP] queries*, *the kind used to build BI
+dashboards.
+
+TPCH has become a de facto standard for analytic systems. While there are [well
+known] limitations as the data and queries do not well represent many real 
world
+use cases, the majori