This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-staging by this push: new 9360158 Commit build products 9360158 is described below commit 93601582e0d915976fa71cac6454d78edb7d8312 Author: Build Pelican (action) <priv...@infra.apache.org> AuthorDate: Sat Apr 5 10:27:08 2025 +0000 Commit build products --- blog/2025/04/10/fastest-tpch-generator/index.html | 58 +++++++++++----------- .../andrew-lamb-achraf-b-and-sean-smith.html | 4 +- blog/category/blog.html | 4 +- blog/feed.xml | 4 +- blog/feeds/all-en.atom.xml | 58 +++++++++++----------- .../andrew-lamb-achraf-b-and-sean-smith.atom.xml | 58 +++++++++++----------- .../andrew-lamb-achraf-b-and-sean-smith.rss.xml | 4 +- blog/feeds/blog.atom.xml | 58 +++++++++++----------- blog/index.html | 4 +- 9 files changed, 126 insertions(+), 126 deletions(-) diff --git a/blog/2025/04/10/fastest-tpch-generator/index.html b/blog/2025/04/10/fastest-tpch-generator/index.html index d2143d7..bdc8088 100644 --- a/blog/2025/04/10/fastest-tpch-generator/index.html +++ b/blog/2025/04/10/fastest-tpch-generator/index.html @@ -4,7 +4,7 @@ <meta charset="utf-8"> <meta http-equiv="x-ua-compatible" content="ie=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> - <title>`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust - Apache DataFusion Blog</title> + <title>tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust - Apache DataFusion Blog</title> <link href="/blog/css/bootstrap.min.css" rel="stylesheet"> <link href="/blog/css/fontawesome.all.min.css" rel="stylesheet"> <link href="/blog/css/headerlink.css" rel="stylesheet"> @@ -40,7 +40,7 @@ <div class="bg-white p-5 rounded"> <div class="col-sm-8 mx-auto"> <h1> - `tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust + tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust </h1> <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and Sean Smith</p> <!-- @@ -70,16 +70,16 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> -<p>It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +<p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic <code>dbgen</code> which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than 2 minutes to create all 3.6 GB of SF=100 in <a href="https://parquet.apache.org/">Apache Parquet</a> format.</p> -<p>Finally, it is convenient and efficient to run TPCH queries locally when testing +<p>Finally, it is convenient and efficient to run TPC-H queries locally when testing analytical engines such as DataFusion.</p> -<p><img alt="Time to create TPCH parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> -<p><strong>Figure 1</strong>: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +<p><img alt="Time to create TPC-H parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> +<p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP VM. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 minute and 14 seconds and <a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48 seconds. For SF=1000, <code>tpchgen</code> takes 10 @@ -87,9 +87,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure DuckDB’s time as it <a href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires 647 GB of RAM</a>, more than the 88 GB that was available on our test machine. The testing methodology is in the <a href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p> -<p>This blog explains what TPCH is, how we ported the vintage C data generator to +<p>This blog explains what TPC-H is, how we ported the vintage C data generator to Rust (yes, <a href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>) and optimized its performance over the course of a few weeks -of part-time work. We began this project so we can easily generate TPCH data in +of part-time work. We began this project so we can easily generate TPC-H data in <a href="https://datafusion.apache.org/">Apache DataFusion</a> and <a href="https://glaredb.com/">GlareDB</a>.</p> <h1>Try if for yourself</h1> <p>The tool is entirely open source under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0 license</a>. Visit the <a href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs repository</a> or try it for yourself by run the following commands after <a href="https://www.rust-lang.org/tools/install">installing Rust</a>:</p> @@ -101,22 +101,22 @@ $ tpchgen-cli -s <span class="m">1</span> <span class="c1"># create SF=10 in Parquet</span> $ tpchgen-cli -s <span class="m">10</span> --format<span class="o">=</span>parquet </code></pre></div> -<h1>What is TPCH / dbgen?</h1> -<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (commonly referred to as TPCH) helps evaluate the +<h1>What is TPC-H / dbgen?</h1> +<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (often referred to as TPCH) helps evaluate the performance of database systems on <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a> queries<em>, </em>the kind used to build BI dashboards.</p> -<p>TPCH has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well +<p>TPC-H has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well known</a> limitations as the data and queries do not well represent many real world use cases, the majority of analytic database papers and industrial systems still -use TPCH query performance benchmarks as a baseline. You will inevitably find +use TPC-H query performance benchmarks as a baseline. You will inevitably find multiple results for “<code>TPCH Performance &lt;your favorite database></code>” in any search engine.</p> <p>The benchmark was created at a time when access to high performance analytical systems was not widespread, so the <a href="https://www.tpc.org/">Transaction Processing Performance Council</a> defined a process of formal result verification. More recently, given the broad availability of free and open source database systems, it is common for users to -run and verify TPCH performance themselves.</p> -<p>TPCH simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, +run and verify TPC-H performance themselves.</p> +<p>TPC-H simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, <code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>, <code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These tables are linked by foreign keys in a normalized schema representing a supply chain with parts, suppliers, customers and orders. The benchmark itself is 22 @@ -184,10 +184,10 @@ bound on the Scale Factor.</p> </td> </tr> </table> -<p><strong>Table 1</strong>: TPCH data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> -<h1>Why do we need a new TPCH Data generator?</h1> -<p>Despite the known limitations of the TPCH benchmark, it is so well known that it -is used frequently in database performance analysis. To run TPCH, you must first +<p><strong>Table 1</strong>: TPC-H data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> +<h1>Why do we need a new TPC-H Data generator?</h1> +<p>Despite the known limitations of the TPC-H benchmark, it is so well known that it +is used frequently in database performance analysis. To run TPC-H, you must first load the data, using <code>dbgen</code>, which is not ideal for several reasons:</p> <ol> <li>You must find and compile a copy of the 15+ year old C program (for example <a href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li> @@ -195,21 +195,21 @@ load the data, using <code>dbgen</code>, which is not ideal for several reasons: <li>It outputs TBL format, which typically requires loading into your database (for example, <a href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here is how to do so</a> in Apache DataFusion) prior to query.</li> <li>The implementation makes substantial assumptions about the operating environment, making it difficult to extend or embed into other systems.[^2]</li> </ol> -<p><img alt="Time to generate TPCH data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> -<p><strong>Figure 3</strong>: Time to generate TPCH data in TBL format. The default <code>tpchgen</code> is +<p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> +<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. The default <code>tpchgen</code> is shown in blue. <code>tpchgen</code> restricted to a single core is shown in red. Unmodified <code>dbgen</code> is shown in green and <code>dbgen</code> modified to use <code>-O3</code> optimization level is shown in yellow.</p> <p><code>dbgen</code> is so inconvenient and takes so long that vendors often provide -preloaded TPCH data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample +preloaded TPC-H data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample datasets</a> and <a href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB Pre-Generated Data Sets</a>.</p> -<p>In addition to pre-generated datasets, DuckDB also provides a <a href="https://duckdb.org/docs/stable/extensions/tpch.html">TPCH extension</a> -for generating TPCH datasets within DuckDB. This is so much easier to use than +<p>In addition to pre-generated datasets, DuckDB also provides a [TPCH extension] +for generating TPC-H datasets within DuckDB. This is so much easier to use than the current alternatives that it leads many researchers and other thought leaders to use DuckDB to evaluate new ideas. For example, <a href="https://github.com/lmwnshn">Wan Shen -Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPCH dataset</a> as one reason +Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPC-H dataset</a> as one reason the first student project of <a href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring 2025</a> used DuckDB.</p> -<p>As beneficial as the DuckDB TPCH extension is, it is non-ideal for several reasons:</p> +<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several reasons:</p> <ol> <li>Creates data in a proprietary format, which requires export to use in other systems.</li> <li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li> @@ -235,7 +235,7 @@ MiDAS Fall 2024 (Data Systems Seminar)</a>, <a href="https://www.youtube.com/wat thanks to <a href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p> <h1>How: The Journey</h1> <p>We did it together as a team in the open over the course of a few weeks. -<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPCH +<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPC-H dataset creation</a> and <a href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting we check out a Java port on February 11, 2025</a>. Achraf made <a href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first commit a few days later</a> on February 16, and <a href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew and Sean started helping on March 8, 2025</a> and we <a href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a> on @@ -406,7 +406,7 @@ However, as described above, the TBL is annoying to work with, because</p> Datalayers for finding and <a href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a> <a href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>) which performs at the same speed as TBL. While CSV files are far more standard than TBL, they must still be parsed prior to load and automatic type inference may not deduce the types -needed for the TPCH benchmarks (e.g. floating point vs Decimal).</p> +needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p> <p>What would be far more useful is a typed, efficient columnar format such as Apache Parquet which is supported by all modern query engines. So we <a href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a <a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a> crate to create <a href="https://arrow.apache.org/">Apache Arrow</a> arrays directly and then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small @@ -464,7 +464,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet</code></p> <h1>Conclusion 👊🎤</h1> <p>With a few days, some fellow database nerds, and the power of Rust we made something 10x better than currently exists. We hope it inspires more research -into analytical systems using the TPCH dataset and that people build awesome +into analytical systems using the TPC-H dataset and that people build awesome things with it. For example, Sean has already added <a href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of tables to GlareDB</a>. Please consider joining us and helping out at <a href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p> diff --git a/blog/author/andrew-lamb-achraf-b-and-sean-smith.html b/blog/author/andrew-lamb-achraf-b-and-sean-smith.html index 5af3895..5e6410d 100644 --- a/blog/author/andrew-lamb-achraf-b-and-sean-smith.html +++ b/blog/author/andrew-lamb-achraf-b-and-sean-smith.html @@ -53,7 +53,7 @@ <article class="post"> <header> <div class="title"> - <h1><a href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust</a></h1> + <h1><a href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust</a></h1> <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and Sean Smith</p> <p><!-- {% comment %} @@ -82,7 +82,7 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></p> <footer> diff --git a/blog/category/blog.html b/blog/category/blog.html index 27bfed8..2bee2c9 100644 --- a/blog/category/blog.html +++ b/blog/category/blog.html @@ -53,7 +53,7 @@ <article class="post"> <header> <div class="title"> - <h1><a href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust</a></h1> + <h1><a href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust</a></h1> <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and Sean Smith</p> <p><!-- {% comment %} @@ -82,7 +82,7 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></p> <footer> diff --git a/blog/feed.xml b/blog/feed.xml index e383569..21826f3 100644 --- a/blog/feed.xml +++ b/blog/feed.xml @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="utf-8"?> -<rss version="2.0"><channel><title>Apache DataFusion Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu, 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description><!-- +<rss version="2.0"><channel><title>Apache DataFusion Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu, 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description><!-- {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -26,7 +26,7 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Lamb, Achraf B, and Sean Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item><item><title>Apache DataFusion 46.0.0 Released</title><link>https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0</link><description><!-- {% comment %} diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml index 4b3c1ec..6a299ae 100644 --- a/blog/feeds/all-en.atom.xml +++ b/blog/feeds/all-en.atom.xml @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="utf-8"?> -<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust</title><link href="https://datafusion.apache.org/blo [...] +<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust</title><link href="https://datafusion.apache.org/blog [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -26,7 +26,7 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></summary><content type="html"><!-- {% comment %} @@ -55,16 +55,16 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> -<p>It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +<p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic <code>dbgen</code> which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than 2 minutes to create all 3.6 GB of SF=100 in <a href="https://parquet.apache.org/">Apache Parquet</a> format.</p> -<p>Finally, it is convenient and efficient to run TPCH queries locally when testing +<p>Finally, it is convenient and efficient to run TPC-H queries locally when testing analytical engines such as DataFusion.</p> -<p><img alt="Time to create TPCH parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> -<p><strong>Figure 1</strong>: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +<p><img alt="Time to create TPC-H parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> +<p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP VM. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 minute and 14 seconds and <a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48 seconds. For SF=1000, <code>tpchgen</code> takes 10 @@ -72,9 +72,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure DuckDB&rsquo;s time as it <a href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires 647 GB of RAM</a>, more than the 88 GB that was available on our test machine. The testing methodology is in the <a href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p> -<p>This blog explains what TPCH is, how we ported the vintage C data generator to +<p>This blog explains what TPC-H is, how we ported the vintage C data generator to Rust (yes, <a href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>) and optimized its performance over the course of a few weeks -of part-time work. We began this project so we can easily generate TPCH data in +of part-time work. We began this project so we can easily generate TPC-H data in <a href="https://datafusion.apache.org/">Apache DataFusion</a> and <a href="https://glaredb.com/">GlareDB</a>.</p> <h1>Try if for yourself</h1> <p>The tool is entirely open source under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0 license</a>. Visit the <a href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs repository</a> or try it for yourself by run the following commands after <a href="https://www.rust-lang.org/tools/install">installing Rust</a>:</p> @@ -86,22 +86,22 @@ $ tpchgen-cli -s <span class="m">1</span> <span class="c1"># create SF=10 in Parquet</span> $ tpchgen-cli -s <span class="m">10</span> --format<span class="o">=</span>parquet </code></pre></div> -<h1>What is TPCH / dbgen?</h1> -<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (commonly referred to as TPCH) helps evaluate the +<h1>What is TPC-H / dbgen?</h1> +<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (often referred to as TPCH) helps evaluate the performance of database systems on <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a> queries<em>, </em>the kind used to build BI dashboards.</p> -<p>TPCH has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well +<p>TPC-H has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well known</a> limitations as the data and queries do not well represent many real world use cases, the majority of analytic database papers and industrial systems still -use TPCH query performance benchmarks as a baseline. You will inevitably find +use TPC-H query performance benchmarks as a baseline. You will inevitably find multiple results for &ldquo;<code>TPCH Performance &amp;lt;your favorite database&gt;</code>&rdquo; in any search engine.</p> <p>The benchmark was created at a time when access to high performance analytical systems was not widespread, so the <a href="https://www.tpc.org/">Transaction Processing Performance Council</a> defined a process of formal result verification. More recently, given the broad availability of free and open source database systems, it is common for users to -run and verify TPCH performance themselves.</p> -<p>TPCH simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, +run and verify TPC-H performance themselves.</p> +<p>TPC-H simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, <code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>, <code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These tables are linked by foreign keys in a normalized schema representing a supply chain with parts, suppliers, customers and orders. The benchmark itself is 22 @@ -169,10 +169,10 @@ bound on the Scale Factor.</p> </td> </tr> </table> -<p><strong>Table 1</strong>: TPCH data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> -<h1>Why do we need a new TPCH Data generator?</h1> -<p>Despite the known limitations of the TPCH benchmark, it is so well known that it -is used frequently in database performance analysis. To run TPCH, you must first +<p><strong>Table 1</strong>: TPC-H data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> +<h1>Why do we need a new TPC-H Data generator?</h1> +<p>Despite the known limitations of the TPC-H benchmark, it is so well known that it +is used frequently in database performance analysis. To run TPC-H, you must first load the data, using <code>dbgen</code>, which is not ideal for several reasons:</p> <ol> <li>You must find and compile a copy of the 15+ year old C program (for example <a href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li> @@ -180,21 +180,21 @@ load the data, using <code>dbgen</code>, which is not ideal for seve <li>It outputs TBL format, which typically requires loading into your database (for example, <a href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here is how to do so</a> in Apache DataFusion) prior to query.</li> <li>The implementation makes substantial assumptions about the operating environment, making it difficult to extend or embed into other systems.[^2]</li> </ol> -<p><img alt="Time to generate TPCH data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> -<p><strong>Figure 3</strong>: Time to generate TPCH data in TBL format. The default <code>tpchgen</code> is +<p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> +<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. The default <code>tpchgen</code> is shown in blue. <code>tpchgen</code> restricted to a single core is shown in red. Unmodified <code>dbgen</code> is shown in green and <code>dbgen</code> modified to use <code>-O3</code> optimization level is shown in yellow.</p> <p><code>dbgen</code> is so inconvenient and takes so long that vendors often provide -preloaded TPCH data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample +preloaded TPC-H data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample datasets</a> and <a href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB Pre-Generated Data Sets</a>.</p> -<p>In addition to pre-generated datasets, DuckDB also provides a <a href="https://duckdb.org/docs/stable/extensions/tpch.html">TPCH extension</a> -for generating TPCH datasets within DuckDB. This is so much easier to use than +<p>In addition to pre-generated datasets, DuckDB also provides a [TPCH extension] +for generating TPC-H datasets within DuckDB. This is so much easier to use than the current alternatives that it leads many researchers and other thought leaders to use DuckDB to evaluate new ideas. For example, <a href="https://github.com/lmwnshn">Wan Shen -Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPCH dataset</a> as one reason +Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPC-H dataset</a> as one reason the first student project of <a href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring 2025</a> used DuckDB.</p> -<p>As beneficial as the DuckDB TPCH extension is, it is non-ideal for several reasons:</p> +<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several reasons:</p> <ol> <li>Creates data in a proprietary format, which requires export to use in other systems.</li> <li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li> @@ -220,7 +220,7 @@ MiDAS Fall 2024 (Data Systems Seminar)</a>, <a href="https://www.youtub thanks to <a href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p> <h1>How: The Journey</h1> <p>We did it together as a team in the open over the course of a few weeks. -<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPCH +<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPC-H dataset creation</a> and <a href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting we check out a Java port on February 11, 2025</a>. Achraf made <a href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first commit a few days later</a> on February 16, and <a href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew and Sean started helping on March 8, 2025</a> and we <a href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a> on @@ -391,7 +391,7 @@ However, as described above, the TBL is annoying to work with, because</p> Datalayers for finding and <a href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a> <a href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>) which performs at the same speed as TBL. While CSV files are far more standard than TBL, they must still be parsed prior to load and automatic type inference may not deduce the types -needed for the TPCH benchmarks (e.g. floating point vs Decimal).</p> +needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p> <p>What would be far more useful is a typed, efficient columnar format such as Apache Parquet which is supported by all modern query engines. So we <a href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a <a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a> crate to create <a href="https://arrow.apache.org/">Apache Arrow</a> arrays directly and then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small @@ -449,7 +449,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet</code></p& <h1>Conclusion 👊🎤</h1> <p>With a few days, some fellow database nerds, and the power of Rust we made something 10x better than currently exists. We hope it inspires more research -into analytical systems using the TPCH dataset and that people build awesome +into analytical systems using the TPC-H dataset and that people build awesome things with it. For example, Sean has already added <a href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of tables to GlareDB</a>. Please consider joining us and helping out at <a href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p> diff --git a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml index 44d2a9a..e706b2e 100644 --- a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml +++ b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="utf-8"?> -<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - Andrew Lamb, Achraf B, and Sean Smith</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs` World’s fastest open source TPCH data generator, [...] +<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - Andrew Lamb, Achraf B, and Sean Smith</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs World’s fastest open source TPC-H data generator, [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -26,7 +26,7 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></summary><content type="html"><!-- {% comment %} @@ -55,16 +55,16 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> -<p>It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +<p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic <code>dbgen</code> which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than 2 minutes to create all 3.6 GB of SF=100 in <a href="https://parquet.apache.org/">Apache Parquet</a> format.</p> -<p>Finally, it is convenient and efficient to run TPCH queries locally when testing +<p>Finally, it is convenient and efficient to run TPC-H queries locally when testing analytical engines such as DataFusion.</p> -<p><img alt="Time to create TPCH parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> -<p><strong>Figure 1</strong>: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +<p><img alt="Time to create TPC-H parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> +<p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP VM. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 minute and 14 seconds and <a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48 seconds. For SF=1000, <code>tpchgen</code> takes 10 @@ -72,9 +72,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure DuckDB&rsquo;s time as it <a href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires 647 GB of RAM</a>, more than the 88 GB that was available on our test machine. The testing methodology is in the <a href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p> -<p>This blog explains what TPCH is, how we ported the vintage C data generator to +<p>This blog explains what TPC-H is, how we ported the vintage C data generator to Rust (yes, <a href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>) and optimized its performance over the course of a few weeks -of part-time work. We began this project so we can easily generate TPCH data in +of part-time work. We began this project so we can easily generate TPC-H data in <a href="https://datafusion.apache.org/">Apache DataFusion</a> and <a href="https://glaredb.com/">GlareDB</a>.</p> <h1>Try if for yourself</h1> <p>The tool is entirely open source under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0 license</a>. Visit the <a href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs repository</a> or try it for yourself by run the following commands after <a href="https://www.rust-lang.org/tools/install">installing Rust</a>:</p> @@ -86,22 +86,22 @@ $ tpchgen-cli -s <span class="m">1</span> <span class="c1"># create SF=10 in Parquet</span> $ tpchgen-cli -s <span class="m">10</span> --format<span class="o">=</span>parquet </code></pre></div> -<h1>What is TPCH / dbgen?</h1> -<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (commonly referred to as TPCH) helps evaluate the +<h1>What is TPC-H / dbgen?</h1> +<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (often referred to as TPCH) helps evaluate the performance of database systems on <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a> queries<em>, </em>the kind used to build BI dashboards.</p> -<p>TPCH has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well +<p>TPC-H has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well known</a> limitations as the data and queries do not well represent many real world use cases, the majority of analytic database papers and industrial systems still -use TPCH query performance benchmarks as a baseline. You will inevitably find +use TPC-H query performance benchmarks as a baseline. You will inevitably find multiple results for &ldquo;<code>TPCH Performance &amp;lt;your favorite database&gt;</code>&rdquo; in any search engine.</p> <p>The benchmark was created at a time when access to high performance analytical systems was not widespread, so the <a href="https://www.tpc.org/">Transaction Processing Performance Council</a> defined a process of formal result verification. More recently, given the broad availability of free and open source database systems, it is common for users to -run and verify TPCH performance themselves.</p> -<p>TPCH simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, +run and verify TPC-H performance themselves.</p> +<p>TPC-H simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, <code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>, <code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These tables are linked by foreign keys in a normalized schema representing a supply chain with parts, suppliers, customers and orders. The benchmark itself is 22 @@ -169,10 +169,10 @@ bound on the Scale Factor.</p> </td> </tr> </table> -<p><strong>Table 1</strong>: TPCH data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> -<h1>Why do we need a new TPCH Data generator?</h1> -<p>Despite the known limitations of the TPCH benchmark, it is so well known that it -is used frequently in database performance analysis. To run TPCH, you must first +<p><strong>Table 1</strong>: TPC-H data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> +<h1>Why do we need a new TPC-H Data generator?</h1> +<p>Despite the known limitations of the TPC-H benchmark, it is so well known that it +is used frequently in database performance analysis. To run TPC-H, you must first load the data, using <code>dbgen</code>, which is not ideal for several reasons:</p> <ol> <li>You must find and compile a copy of the 15+ year old C program (for example <a href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li> @@ -180,21 +180,21 @@ load the data, using <code>dbgen</code>, which is not ideal for seve <li>It outputs TBL format, which typically requires loading into your database (for example, <a href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here is how to do so</a> in Apache DataFusion) prior to query.</li> <li>The implementation makes substantial assumptions about the operating environment, making it difficult to extend or embed into other systems.[^2]</li> </ol> -<p><img alt="Time to generate TPCH data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> -<p><strong>Figure 3</strong>: Time to generate TPCH data in TBL format. The default <code>tpchgen</code> is +<p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> +<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. The default <code>tpchgen</code> is shown in blue. <code>tpchgen</code> restricted to a single core is shown in red. Unmodified <code>dbgen</code> is shown in green and <code>dbgen</code> modified to use <code>-O3</code> optimization level is shown in yellow.</p> <p><code>dbgen</code> is so inconvenient and takes so long that vendors often provide -preloaded TPCH data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample +preloaded TPC-H data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample datasets</a> and <a href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB Pre-Generated Data Sets</a>.</p> -<p>In addition to pre-generated datasets, DuckDB also provides a <a href="https://duckdb.org/docs/stable/extensions/tpch.html">TPCH extension</a> -for generating TPCH datasets within DuckDB. This is so much easier to use than +<p>In addition to pre-generated datasets, DuckDB also provides a [TPCH extension] +for generating TPC-H datasets within DuckDB. This is so much easier to use than the current alternatives that it leads many researchers and other thought leaders to use DuckDB to evaluate new ideas. For example, <a href="https://github.com/lmwnshn">Wan Shen -Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPCH dataset</a> as one reason +Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPC-H dataset</a> as one reason the first student project of <a href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring 2025</a> used DuckDB.</p> -<p>As beneficial as the DuckDB TPCH extension is, it is non-ideal for several reasons:</p> +<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several reasons:</p> <ol> <li>Creates data in a proprietary format, which requires export to use in other systems.</li> <li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li> @@ -220,7 +220,7 @@ MiDAS Fall 2024 (Data Systems Seminar)</a>, <a href="https://www.youtub thanks to <a href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p> <h1>How: The Journey</h1> <p>We did it together as a team in the open over the course of a few weeks. -<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPCH +<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPC-H dataset creation</a> and <a href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting we check out a Java port on February 11, 2025</a>. Achraf made <a href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first commit a few days later</a> on February 16, and <a href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew and Sean started helping on March 8, 2025</a> and we <a href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a> on @@ -391,7 +391,7 @@ However, as described above, the TBL is annoying to work with, because</p> Datalayers for finding and <a href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a> <a href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>) which performs at the same speed as TBL. While CSV files are far more standard than TBL, they must still be parsed prior to load and automatic type inference may not deduce the types -needed for the TPCH benchmarks (e.g. floating point vs Decimal).</p> +needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p> <p>What would be far more useful is a typed, efficient columnar format such as Apache Parquet which is supported by all modern query engines. So we <a href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a <a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a> crate to create <a href="https://arrow.apache.org/">Apache Arrow</a> arrays directly and then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small @@ -449,7 +449,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet</code></p& <h1>Conclusion 👊🎤</h1> <p>With a few days, some fellow database nerds, and the power of Rust we made something 10x better than currently exists. We hope it inspires more research -into analytical systems using the TPCH dataset and that people build awesome +into analytical systems using the TPC-H dataset and that people build awesome things with it. For example, Sean has already added <a href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of tables to GlareDB</a>. Please consider joining us and helping out at <a href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p> diff --git a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml index 1f08dc0..7e99eb9 100644 --- a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml +++ b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="utf-8"?> -<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb, Achraf B, and Sean Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu, 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description><!-- +<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb, Achraf B, and Sean Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu, 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description><!-- {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -26,6 +26,6 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Lamb, Achraf B, and Sean Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item></channel></rss> \ No newline at end of file diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml index 81852fe..3fbdc65 100644 --- a/blog/feeds/blog.atom.xml +++ b/blog/feeds/blog.atom.xml @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="utf-8"?> -<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/blog.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust</title><link href="https://datafusion.apache.or [...] +<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog - blog</title><link href="https://datafusion.apache.org/blog/" rel="alternate"></link><link href="https://datafusion.apache.org/blog/feeds/blog.atom.xml" rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust</title><link href="https://datafusion.apache.org [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -26,7 +26,7 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></summary><content type="html"><!-- {% comment %} @@ -55,16 +55,16 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> -<p>It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 GB/s +<p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic <code>dbgen</code> which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less than 2 minutes to create all 3.6 GB of SF=100 in <a href="https://parquet.apache.org/">Apache Parquet</a> format.</p> -<p>Finally, it is convenient and efficient to run TPCH queries locally when testing +<p>Finally, it is convenient and efficient to run TPC-H queries locally when testing analytical engines such as DataFusion.</p> -<p><img alt="Time to create TPCH parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> -<p><strong>Figure 1</strong>: Time to create TPCH dataset for Scale Factor (see below) 1, 10, +<p><img alt="Time to create TPC-H parquet dataset for Scale Factor 1, 10, 100 and 1000" class="img-responsive" src="/blog/images/fastest-tpch-generator/parquet-performance.png" width="80%"/></p> +<p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP VM. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 minute and 14 seconds and <a href="https://duckdb.org">DuckDB</a> takes 17 minutes and 48 seconds. For SF=1000, <code>tpchgen</code> takes 10 @@ -72,9 +72,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure DuckDB&rsquo;s time as it <a href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator">requires 647 GB of RAM</a>, more than the 88 GB that was available on our test machine. The testing methodology is in the <a href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md">documentation</a>.</p> -<p>This blog explains what TPCH is, how we ported the vintage C data generator to +<p>This blog explains what TPC-H is, how we ported the vintage C data generator to Rust (yes, <a href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/">RWIR</a>) and optimized its performance over the course of a few weeks -of part-time work. We began this project so we can easily generate TPCH data in +of part-time work. We began this project so we can easily generate TPC-H data in <a href="https://datafusion.apache.org/">Apache DataFusion</a> and <a href="https://glaredb.com/">GlareDB</a>.</p> <h1>Try if for yourself</h1> <p>The tool is entirely open source under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0 license</a>. Visit the <a href="https://github.com/clflushopt/tpchgen-rs">tpchgen-rs repository</a> or try it for yourself by run the following commands after <a href="https://www.rust-lang.org/tools/install">installing Rust</a>:</p> @@ -86,22 +86,22 @@ $ tpchgen-cli -s <span class="m">1</span> <span class="c1"># create SF=10 in Parquet</span> $ tpchgen-cli -s <span class="m">10</span> --format<span class="o">=</span>parquet </code></pre></div> -<h1>What is TPCH / dbgen?</h1> -<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (commonly referred to as TPCH) helps evaluate the +<h1>What is TPC-H / dbgen?</h1> +<p>The popular <a href="https://www.tpc.org/tpch/">TPC-H</a> benchmark (often referred to as TPCH) helps evaluate the performance of database systems on <a href="https://en.wikipedia.org/wiki/Online_analytical_processing">OLAP</a> queries<em>, </em>the kind used to build BI dashboards.</p> -<p>TPCH has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well +<p>TPC-H has become a de facto standard for analytic systems. While there are <a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf">well known</a> limitations as the data and queries do not well represent many real world use cases, the majority of analytic database papers and industrial systems still -use TPCH query performance benchmarks as a baseline. You will inevitably find +use TPC-H query performance benchmarks as a baseline. You will inevitably find multiple results for &ldquo;<code>TPCH Performance &amp;lt;your favorite database&gt;</code>&rdquo; in any search engine.</p> <p>The benchmark was created at a time when access to high performance analytical systems was not widespread, so the <a href="https://www.tpc.org/">Transaction Processing Performance Council</a> defined a process of formal result verification. More recently, given the broad availability of free and open source database systems, it is common for users to -run and verify TPCH performance themselves.</p> -<p>TPCH simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, +run and verify TPC-H performance themselves.</p> +<p>TPC-H simulates a business environment with eight tables: <code>REGION</code>, <code>NATION</code>, <code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>, <code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These tables are linked by foreign keys in a normalized schema representing a supply chain with parts, suppliers, customers and orders. The benchmark itself is 22 @@ -169,10 +169,10 @@ bound on the Scale Factor.</p> </td> </tr> </table> -<p><strong>Table 1</strong>: TPCH data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> -<h1>Why do we need a new TPCH Data generator?</h1> -<p>Despite the known limitations of the TPCH benchmark, it is so well known that it -is used frequently in database performance analysis. To run TPCH, you must first +<p><strong>Table 1</strong>: TPC-H data set sizes at different scale factors for both TBL and <a href="https://parquet.apache.org/">Apache Parquet</a>.</p> +<h1>Why do we need a new TPC-H Data generator?</h1> +<p>Despite the known limitations of the TPC-H benchmark, it is so well known that it +is used frequently in database performance analysis. To run TPC-H, you must first load the data, using <code>dbgen</code>, which is not ideal for several reasons:</p> <ol> <li>You must find and compile a copy of the 15+ year old C program (for example <a href="https://github.com/electrum/tpch-dbgen">electrum/tpch-dbgen</a>)</li> @@ -180,21 +180,21 @@ load the data, using <code>dbgen</code>, which is not ideal for seve <li>It outputs TBL format, which typically requires loading into your database (for example, <a href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124">here is how to do so</a> in Apache DataFusion) prior to query.</li> <li>The implementation makes substantial assumptions about the operating environment, making it difficult to extend or embed into other systems.[^2]</li> </ol> -<p><img alt="Time to generate TPCH data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> -<p><strong>Figure 3</strong>: Time to generate TPCH data in TBL format. The default <code>tpchgen</code> is +<p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive" src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p> +<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. The default <code>tpchgen</code> is shown in blue. <code>tpchgen</code> restricted to a single core is shown in red. Unmodified <code>dbgen</code> is shown in green and <code>dbgen</code> modified to use <code>-O3</code> optimization level is shown in yellow.</p> <p><code>dbgen</code> is so inconvenient and takes so long that vendors often provide -preloaded TPCH data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample +preloaded TPC-H data, for example [Snowflake Sample Data], <a href="https://docs.databricks.com/aws/en/discover/databricks-datasets">DataBricks Sample datasets</a> and <a href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets">DuckDB Pre-Generated Data Sets</a>.</p> -<p>In addition to pre-generated datasets, DuckDB also provides a <a href="https://duckdb.org/docs/stable/extensions/tpch.html">TPCH extension</a> -for generating TPCH datasets within DuckDB. This is so much easier to use than +<p>In addition to pre-generated datasets, DuckDB also provides a [TPCH extension] +for generating TPC-H datasets within DuckDB. This is so much easier to use than the current alternatives that it leads many researchers and other thought leaders to use DuckDB to evaluate new ideas. For example, <a href="https://github.com/lmwnshn">Wan Shen -Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPCH dataset</a> as one reason +Lim</a>explicitly <a href="https://github.com/apache/datafusion/issues/14373">mentioned the ease of creating the TPC-H dataset</a> as one reason the first student project of <a href="https://15799.courses.cs.cmu.edu/spring2025/">CMU-799 Spring 2025</a> used DuckDB.</p> -<p>As beneficial as the DuckDB TPCH extension is, it is non-ideal for several reasons:</p> +<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several reasons:</p> <ol> <li>Creates data in a proprietary format, which requires export to use in other systems.</li> <li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li> @@ -220,7 +220,7 @@ MiDAS Fall 2024 (Data Systems Seminar)</a>, <a href="https://www.youtub thanks to <a href="https://x.com/KurtFehlhauer">@KurtFehlhauer</a></p> <h1>How: The Journey</h1> <p>We did it together as a team in the open over the course of a few weeks. -<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPCH +<a href="https://github.com/lmwnshn">Wan Shen Lim</a> inspired the project by pointing out the benefits of <a href="https://github.com/apache/datafusion/issues/14373">easy TPC-H dataset creation</a> and <a href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600">suggesting we check out a Java port on February 11, 2025</a>. Achraf made <a href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0">first commit a few days later</a> on February 16, and <a href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e">Andrew and Sean started helping on March 8, 2025</a> and we <a href="https://crates.io/crates/tpchgen/0.1.0">released version 0.1</a> on @@ -391,7 +391,7 @@ However, as described above, the TBL is annoying to work with, because</p> Datalayers for finding and <a href="https://github.com/clflushopt/tpchgen-rs/issues/73">fixing</a> <a href="https://github.com/clflushopt/tpchgen-rs/issues/65">bugs</a>) which performs at the same speed as TBL. While CSV files are far more standard than TBL, they must still be parsed prior to load and automatic type inference may not deduce the types -needed for the TPCH benchmarks (e.g. floating point vs Decimal).</p> +needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p> <p>What would be far more useful is a typed, efficient columnar format such as Apache Parquet which is supported by all modern query engines. So we <a href="https://github.com/clflushopt/tpchgen-rs/pull/71">made</a> a <a href="https://crates.io/crates/tpchgen-arrow">tpchgen-arrow</a> crate to create <a href="https://arrow.apache.org/">Apache Arrow</a> arrays directly and then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61">a small @@ -449,7 +449,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR --format=parquet</code></p& <h1>Conclusion 👊🎤</h1> <p>With a few days, some fellow database nerds, and the power of Rust we made something 10x better than currently exists. We hope it inspires more research -into analytical systems using the TPCH dataset and that people build awesome +into analytical systems using the TPC-H dataset and that people build awesome things with it. For example, Sean has already added <a href="https://github.com/GlareDB/glaredb/pull/3549">on-demand generation of tables to GlareDB</a>. Please consider joining us and helping out at <a href="https://github.com/clflushopt/tpchgen-rs">https://github.com/clflushopt/tpchgen-rs</a>.</p> diff --git a/blog/index.html b/blog/index.html index 0c8d237..a6abc75 100644 --- a/blog/index.html +++ b/blog/index.html @@ -50,7 +50,7 @@ <article class="post"> <header> <div class="title"> - <h1><a href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest open source TPCH data generator, written in Rust</a></h1> + <h1><a href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open source TPC-H data generator, written in Rust</a></h1> <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and Sean Smith</p> <p><!-- {% comment %} @@ -79,7 +79,7 @@ th, td { } </style> <p>3 members of the <a href="https://datafusion.apache.org/">Apache DataFusion</a> community used Rust and open source -development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPCH data generator over 10x +development to build <a href="https://github.com/alamb/tpchgen-rs">tpchgen-rs</a>, a fully open TPC-H data generator over 10x faster than any other implementation we know of.</p> <p>It is now …</p></p> <footer> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org