(datafusion-site) branch asf-staging updated: Commit build products

github-bot Sat, 05 Apr 2025 03:27:48 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-staging by this push:
     new 9360158  Commit build products
9360158 is described below

commit 93601582e0d915976fa71cac6454d78edb7d8312
Author: Build Pelican (action) <priv...@infra.apache.org>
AuthorDate: Sat Apr 5 10:27:08 2025 +0000

    Commit build products
---
 blog/2025/04/10/fastest-tpch-generator/index.html  | 58 +++++++++++-----------
 .../andrew-lamb-achraf-b-and-sean-smith.html       |  4 +-
 blog/category/blog.html                            |  4 +-
 blog/feed.xml                                      |  4 +-
 blog/feeds/all-en.atom.xml                         | 58 +++++++++++-----------
 .../andrew-lamb-achraf-b-and-sean-smith.atom.xml   | 58 +++++++++++-----------
 .../andrew-lamb-achraf-b-and-sean-smith.rss.xml    |  4 +-
 blog/feeds/blog.atom.xml                           | 58 +++++++++++-----------
 blog/index.html                                    |  4 +-
 9 files changed, 126 insertions(+), 126 deletions(-)

diff --git a/blog/2025/04/10/fastest-tpch-generator/index.html 
b/blog/2025/04/10/fastest-tpch-generator/index.html
index d2143d7..bdc8088 100644
--- a/blog/2025/04/10/fastest-tpch-generator/index.html
+++ b/blog/2025/04/10/fastest-tpch-generator/index.html
@@ -4,7 +4,7 @@
     <meta charset="utf-8">
     <meta http-equiv="x-ua-compatible" content="ie=edge">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>`tpchgen-rs` World’s fastest open source TPCH data generator, 
written in Rust - Apache DataFusion Blog</title>
+    <title>tpchgen-rs World’s fastest open source TPC-H data generator, 
written in Rust - Apache DataFusion Blog</title>
 <link href="/blog/css/bootstrap.min.css" rel="stylesheet">
 <link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
 <link href="/blog/css/headerlink.css" rel="stylesheet">
@@ -40,7 +40,7 @@
     <div class="bg-white p-5 rounded">
         <div class="col-sm-8 mx-auto">
           <h1>
-              `tpchgen-rs` World’s fastest open source TPCH data generator, 
written in Rust
+              tpchgen-rs World’s fastest open source TPC-H data generator, 
written in Rust
           </h1>
               <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf B, and 
Sean Smith</p>
               <!--
@@ -70,16 +70,16 @@ th, td {
 }
 </style>
 <p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
-development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPCH 
data generator over 10x
+development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPC-H 
data generator over 10x
 faster than any other implementation  we know of.</p>
-<p>It is now possible to create the TPCH SF=100 dataset in 72.23 seconds (1.4 
GB/s
+<p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
<code>dbgen</code>
 which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
 2 minutes to create all 3.6 GB of SF=100 in <a 
href="https://parquet.apache.org/";>Apache Parquet</a> format.</p>
-<p>Finally, it is convenient and efficient to run TPCH queries locally when 
testing
+<p>Finally, it is convenient and efficient to run TPC-H queries locally when 
testing
 analytical engines such as DataFusion.</p>
-<p><img alt="Time to create TPCH parquet dataset for Scale Factor  1, 10, 100 
and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/></p>
-<p><strong>Figure 1</strong>: Time to create TPCH dataset for Scale Factor 
(see below) 1, 10,
+<p><img alt="Time to create TPC-H parquet dataset for Scale Factor  1, 10, 100 
and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/></p>
+<p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor 
(see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 minute and 14 
seconds and
 <a href="https://duckdb.org";>DuckDB</a> takes 17 minutes and 48 seconds. For 
SF=1000, <code>tpchgen</code> takes 10
@@ -87,9 +87,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we 
could not measure
 DuckDB&rsquo;s time as it <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>requires
 647 GB of RAM</a>, more than the 88 GB that was
 available on our test machine. The testing methodology is in the
 <a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md";>documentation</a>.</p>
-<p>This blog explains what TPCH is, how we ported the vintage C data generator 
to
+<p>This blog explains what TPC-H is, how we ported the vintage C data 
generator to
 Rust (yes, <a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/";>RWIR</a>)
 and optimized its performance over the course of a few weeks
-of part-time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPC-H data 
in
 <a href="https://datafusion.apache.org/";>Apache DataFusion</a> and <a 
href="https://glaredb.com/";>GlareDB</a>.</p>
 <h1>Try if for yourself</h1>
 <p>The tool is entirely open source under the <a 
href="https://www.apache.org/licenses/LICENSE-2.0";>Apache 2.0 license</a>. 
Visit the <a href="https://github.com/clflushopt/tpchgen-rs";>tpchgen-rs 
repository</a> or try it for yourself by run the following commands after <a 
href="https://www.rust-lang.org/tools/install";>installing Rust</a>:</p>
@@ -101,22 +101,22 @@ $ tpchgen-cli -s <span class="m">1</span>
 <span class="c1"># create SF=10 in Parquet</span>
 $ tpchgen-cli -s <span class="m">10</span> --format<span 
class="o">=</span>parquet
 </code></pre></div>
-<h1>What is TPCH / dbgen?</h1>
-<p>The popular <a href="https://www.tpc.org/tpch/";>TPC-H</a> benchmark 
(commonly referred to as TPCH) helps evaluate the
+<h1>What is TPC-H / dbgen?</h1>
+<p>The popular <a href="https://www.tpc.org/tpch/";>TPC-H</a> benchmark (often 
referred to as TPCH) helps evaluate the
 performance of database systems on <a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing";>OLAP</a> 
queries<em>, </em>the kind used to build BI
 dashboards.</p>
-<p>TPCH has become a de facto standard for analytic systems. While there are 
<a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf";>well
+<p>TPC-H has become a de facto standard for analytic systems. While there are 
<a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf";>well
 known</a> limitations as the data and queries do not well represent many real 
world
 use cases, the majority of analytic database papers and industrial systems 
still
-use TPCH query performance benchmarks as a baseline. You will inevitably find
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
 multiple results for  &ldquo;<code>TPCH Performance &amp;lt;your favorite 
database&gt;</code>&rdquo; in any
 search engine.</p>
 <p>The benchmark was created at a time when access to high performance 
analytical
 systems was not widespread, so the <a href="https://www.tpc.org/";>Transaction 
Processing Performance Council</a>
 defined a process of formal result verification. More recently, given the broad
 availability of free and open source database systems, it is common for users 
to
-run and verify TPCH performance themselves.</p>
-<p>TPCH simulates a business environment with eight tables: 
<code>REGION</code>, <code>NATION</code>,
+run and verify TPC-H performance themselves.</p>
+<p>TPC-H simulates a business environment with eight tables: 
<code>REGION</code>, <code>NATION</code>,
 <code>SUPPLIER</code>, <code>CUSTOMER</code>, <code>PART</code>, 
<code>PARTSUPP</code>, <code>ORDERS</code>, and <code>LINEITEM</code>. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
@@ -184,10 +184,10 @@ bound on the Scale Factor.</p>
    </td>
 </tr>
 </table>
-<p><strong>Table 1</strong>: TPCH data set sizes at different scale factors 
for both TBL and <a href="https://parquet.apache.org/";>Apache Parquet</a>.</p>
-<h1>Why do we need a new TPCH Data generator?</h1>
-<p>Despite the known limitations of the TPCH benchmark, it is so well known 
that it
-is used frequently in database performance analysis. To run TPCH, you must 
first
+<p><strong>Table 1</strong>: TPC-H data set sizes at different scale factors 
for both TBL and <a href="https://parquet.apache.org/";>Apache Parquet</a>.</p>
+<h1>Why do we need a new TPC-H Data generator?</h1>
+<p>Despite the known limitations of the TPC-H benchmark, it is so well known 
that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
 load the data, using <code>dbgen</code>, which is not ideal for several 
reasons:</p>
 <ol>
 <li>You must find and compile a copy of the 15+ year old C program (for 
example <a 
href="https://github.com/electrum/tpch-dbgen";>electrum/tpch-dbgen</a>)</li>
@@ -195,21 +195,21 @@ load the data, using <code>dbgen</code>, which is not 
ideal for several reasons:
 <li>It outputs TBL format, which typically requires loading into your database 
(for example, <a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124";>here
 is how to do so</a> in Apache DataFusion) prior to query.</li>
 <li>The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other systems.[^2]</li>
 </ol>
-<p><img alt="Time to generate TPCH data in TBL format" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p>
-<p><strong>Figure 3</strong>: Time to generate TPCH data in TBL format. The 
default <code>tpchgen</code> is
+<p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p>
+<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. The 
default <code>tpchgen</code> is
 shown in blue. <code>tpchgen</code> restricted to a single core is shown in 
red. Unmodified
 <code>dbgen</code> is shown in green and <code>dbgen</code> modified to use 
<code>-O3</code> optimization level
 is shown in yellow.</p>
 <p><code>dbgen</code> is so inconvenient and takes so long that vendors often 
provide
-preloaded TPCH data, for example [Snowflake Sample Data], <a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets";>DataBricks
 Sample
+preloaded TPC-H data, for example [Snowflake Sample Data], <a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets";>DataBricks
 Sample
 datasets</a> and <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>DuckDB
 Pre-Generated Data Sets</a>.</p>
-<p>In addition to pre-generated datasets, DuckDB also provides a <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html";>TPCH extension</a> 
-for generating TPCH datasets within DuckDB. This is so much easier to use than
+<p>In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, <a 
href="https://github.com/lmwnshn";>Wan Shen
-Lim</a>explicitly <a 
href="https://github.com/apache/datafusion/issues/14373";>mentioned the ease of 
creating the TPCH dataset</a> as one reason
+Lim</a>explicitly <a 
href="https://github.com/apache/datafusion/issues/14373";>mentioned the ease of 
creating the TPC-H dataset</a> as one reason
 the first student project of <a 
href="https://15799.courses.cs.cmu.edu/spring2025/";>CMU-799 Spring 2025</a> 
used DuckDB.</p>
-<p>As beneficial as the DuckDB TPCH extension is, it is non-ideal for several 
reasons:</p>
+<p>As beneficial as the DuckDB TPC-H extension is, it is non-ideal for several 
reasons:</p>
 <ol>
 <li>Creates data in a proprietary format, which requires export to use in 
other systems.</li>
 <li>Requires significant time (e.g. 17 minutes for Scale Factor 10).</li>
@@ -235,7 +235,7 @@ MiDAS Fall 2024 (Data Systems Seminar)</a>, <a 
href="https://www.youtube.com/wat
 thanks to <a href="https://x.com/KurtFehlhauer";>@KurtFehlhauer</a></p>
 <h1>How: The Journey</h1>
 <p>We did it together as a team in the open over the course of a few weeks.
-<a href="https://github.com/lmwnshn";>Wan Shen Lim</a> inspired the project by 
pointing out the benefits of <a 
href="https://github.com/apache/datafusion/issues/14373";>easy TPCH
+<a href="https://github.com/lmwnshn";>Wan Shen Lim</a> inspired the project by 
pointing out the benefits of <a 
href="https://github.com/apache/datafusion/issues/14373";>easy TPC-H
 dataset creation</a>  and <a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600";>suggesting
 we check out a Java port on February 11,
 2025</a>. Achraf made <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0";>first
 commit a few days later</a> on February 16, and <a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e";>Andrew
 and Sean started helping on March 8, 2025</a> and we <a 
href="https://crates.io/crates/tpchgen/0.1.0";>released version 0.1</a> on
@@ -406,7 +406,7 @@ However, as described above, the TBL is annoying to work 
with, because</p>
 Datalayers for finding and <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73";>fixing</a> <a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65";>bugs</a>) which 
performs at the same
 speed as TBL. While CSV files are far more standard than TBL, they must still 
be
 parsed prior to load and automatic type inference may not deduce the types
-needed for the TPCH benchmarks (e.g. floating point vs Decimal).</p>
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).</p>
 <p>What would be far more useful is a typed, efficient columnar format such as
 Apache Parquet which is supported by all modern query engines. So we <a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71";>made</a> a
 <a href="https://crates.io/crates/tpchgen-arrow";>tpchgen-arrow</a> crate to 
create <a href="https://arrow.apache.org/";>Apache Arrow</a> arrays directly and 
then <a href="https://github.com/clflushopt/tpchgen-rs/pull/61";>a small
@@ -464,7 +464,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet</code></p>
 <h1>Conclusion 👊🎤</h1>
 <p>With a few days, some fellow database nerds, and the power of Rust we made
 something 10x better than currently exists. We hope it inspires more research
-into analytical systems using the TPCH dataset and that people build awesome
+into analytical systems using the TPC-H dataset and that people build awesome
 things with it. For example, Sean has already added <a 
href="https://github.com/GlareDB/glaredb/pull/3549";>on-demand generation of
 tables to GlareDB</a>. Please consider joining us and helping out at
 <a 
href="https://github.com/clflushopt/tpchgen-rs";>https://github.com/clflushopt/tpchgen-rs</a>.</p>
diff --git a/blog/author/andrew-lamb-achraf-b-and-sean-smith.html 
b/blog/author/andrew-lamb-achraf-b-and-sean-smith.html
index 5af3895..5e6410d 100644
--- a/blog/author/andrew-lamb-achraf-b-and-sean-smith.html
+++ b/blog/author/andrew-lamb-achraf-b-and-sean-smith.html
@@ -53,7 +53,7 @@
             <article class="post">
                 <header>
                     <div class="title">
-                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest 
open source TPCH data generator, written in Rust</a></h1>
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open 
source TPC-H data generator, written in Rust</a></h1>
                         <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
                         <p><!--
 {% comment %}
@@ -82,7 +82,7 @@ th, td {
 }
 </style>
 <p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
-development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPCH 
data generator over 10x
+development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPC-H 
data generator over 10x
 faster than any other implementation  we know of.</p>
 <p>It is now …</p></p>
                         <footer>
diff --git a/blog/category/blog.html b/blog/category/blog.html
index 27bfed8..2bee2c9 100644
--- a/blog/category/blog.html
+++ b/blog/category/blog.html
@@ -53,7 +53,7 @@
             <article class="post">
                 <header>
                     <div class="title">
-                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest 
open source TPCH data generator, written in Rust</a></h1>
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open 
source TPC-H data generator, written in Rust</a></h1>
                         <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
                         <p><!--
 {% comment %}
@@ -82,7 +82,7 @@ th, td {
 }
 </style>
 <p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
-development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPCH 
data generator over 10x
+development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPC-H 
data generator over 10x
 faster than any other implementation  we know of.</p>
 <p>It is now …</p></p>
                         <footer>
diff --git a/blog/feed.xml b/blog/feed.xml
index e383569..21826f3 100644
--- a/blog/feed.xml
+++ b/blog/feed.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion 
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>`tpchgen-rs` World’s 
fastest open source TPCH data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
+<rss version="2.0"><channel><title>Apache DataFusion 
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s 
fastest open source TPC-H data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -26,7 +26,7 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>Andrew Lamb, Achraf B, and Sean 
Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item><item><title>Apache
 DataFusion 46.0.0 
Released</title><link>https://datafusion.apache.org/blog/2025/03/24/datafusion-46.0.0</link><description>&lt;!--
 {% comment %}
diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml
index 4b3c1ec..6a299ae 100644
--- a/blog/feeds/all-en.atom.xml
+++ b/blog/feeds/all-en.atom.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion 
Blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs`
 World’s fastest open source TPCH data generator, written in Rust</title><link 
href="https://datafusion.apache.org/blo [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion 
Blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
 World’s fastest open source TPC-H data generator, written in Rust</title><link 
href="https://datafusion.apache.org/blog [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -26,7 +26,7 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now …&lt;/p&gt;</summary><content type="html">&lt;!--
 {% comment %}
@@ -55,16 +55,16 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
-&lt;p&gt;It is now possible to create the TPCH SF=100 dataset in 72.23 seconds 
(1.4 GB/s
+&lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
 which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
 2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format.&lt;/p&gt;
-&lt;p&gt;Finally, it is convenient and efficient to run TPCH queries locally 
when testing
+&lt;p&gt;Finally, it is convenient and efficient to run TPC-H queries locally 
when testing
 analytical engines such as DataFusion.&lt;/p&gt;
-&lt;p&gt;&lt;img alt="Time to create TPCH parquet dataset for Scale Factor  1, 
10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPCH dataset 
for Scale Factor (see below) 1, 10,
+&lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
 &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
@@ -72,9 +72,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we 
could not measure
 DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
 available on our test machine. The testing methodology is in the
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;This blog explains what TPCH is, how we ported the vintage C data 
generator to
+&lt;p&gt;This blog explains what TPC-H is, how we ported the vintage C data 
generator to
 Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
-of part-time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPC-H data 
in
 &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Try if for yourself&lt;/h1&gt;
 &lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
@@ -86,22 +86,22 @@ $ tpchgen-cli -s &lt;span class="m"&gt;1&lt;/span&gt;
 &lt;span class="c1"&gt;# create SF=10 in Parquet&lt;/span&gt;
 $ tpchgen-cli -s &lt;span class="m"&gt;10&lt;/span&gt; --format&lt;span 
class="o"&gt;=&lt;/span&gt;parquet
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
-&lt;h1&gt;What is TPCH / dbgen?&lt;/h1&gt;
-&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to as TPCH) helps evaluate the
+&lt;h1&gt;What is TPC-H / dbgen?&lt;/h1&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (often referred to as TPCH) helps evaluate the
 performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;, &lt;/em&gt;the kind used to build BI
 dashboards.&lt;/p&gt;
-&lt;p&gt;TPCH has become a de facto standard for analytic systems. While there 
are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
+&lt;p&gt;TPC-H has become a de facto standard for analytic systems. While 
there are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
 known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
 use cases, the majority of analytic database papers and industrial systems 
still
-use TPCH query performance benchmarks as a baseline. You will inevitably find
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
 multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
 search engine.&lt;/p&gt;
 &lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
 systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
 defined a process of formal result verification. More recently, given the broad
 availability of free and open source database systems, it is common for users 
to
-run and verify TPCH performance themselves.&lt;/p&gt;
-&lt;p&gt;TPCH simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
+run and verify TPC-H performance themselves.&lt;/p&gt;
+&lt;p&gt;TPC-H simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
 &lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
@@ -169,10 +169,10 @@ bound on the Scale Factor.&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPCH data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
-&lt;h1&gt;Why do we need a new TPCH Data generator?&lt;/h1&gt;
-&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it
-is used frequently in database performance analysis. To run TPCH, you must 
first
+&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPC-H data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Why do we need a new TPC-H Data generator?&lt;/h1&gt;
+&lt;p&gt;Despite the known limitations of the TPC-H benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
 load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
@@ -180,21 +180,21 @@ load the data, using &lt;code&gt;dbgen&lt;/code&gt;, 
which is not ideal for seve
 &lt;li&gt;It outputs TBL format, which typically requires loading into your 
database (for example, &lt;a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124"&gt;here
 is how to do so&lt;/a&gt; in Apache DataFusion) prior to query.&lt;/li&gt;
 &lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.[^2]&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;&lt;img alt="Time to generate TPCH data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPCH data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
+&lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
 shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-preloaded TPCH data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+preloaded TPC-H data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
 datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH 
extension&lt;/a&gt; 
-for generating TPCH datasets within DuckDB. This is so much easier to use than
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
-Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPCH dataset&lt;/a&gt; as one reason
+Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPC-H dataset&lt;/a&gt; as one reason
 the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
-&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPC-H extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
 &lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
@@ -220,7 +220,7 @@ MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtub
 thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
 &lt;h1&gt;How: The Journey&lt;/h1&gt;
 &lt;p&gt;We did it together as a team in the open over the course of a few 
weeks.
-&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPC-H
 dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
 2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
 and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
@@ -391,7 +391,7 @@ However, as described above, the TBL is annoying to work 
with, because&lt;/p&gt;
 Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
 speed as TBL. While CSV files are far more standard than TBL, they must still 
be
 parsed prior to load and automatic type inference may not deduce the types
-needed for the TPCH benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
 &lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
 Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
 &lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
@@ -449,7 +449,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&
 &lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
 &lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made
 something 10x better than currently exists. We hope it inspires more research
-into analytical systems using the TPCH dataset and that people build awesome
+into analytical systems using the TPC-H dataset and that people build awesome
 things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
 tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
diff --git a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml 
b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
index 44d2a9a..e706b2e 100644
--- a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
+++ b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
Andrew Lamb, Achraf B, and Sean Smith</title><link 
href="https://datafusion.apache.org/blog/"; rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml";
 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs`
 World’s fastest open source TPCH data generator, [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
Andrew Lamb, Achraf B, and Sean Smith</title><link 
href="https://datafusion.apache.org/blog/"; rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml";
 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
 World’s fastest open source TPC-H data generator,  [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -26,7 +26,7 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now …&lt;/p&gt;</summary><content type="html">&lt;!--
 {% comment %}
@@ -55,16 +55,16 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
-&lt;p&gt;It is now possible to create the TPCH SF=100 dataset in 72.23 seconds 
(1.4 GB/s
+&lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
 which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
 2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format.&lt;/p&gt;
-&lt;p&gt;Finally, it is convenient and efficient to run TPCH queries locally 
when testing
+&lt;p&gt;Finally, it is convenient and efficient to run TPC-H queries locally 
when testing
 analytical engines such as DataFusion.&lt;/p&gt;
-&lt;p&gt;&lt;img alt="Time to create TPCH parquet dataset for Scale Factor  1, 
10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPCH dataset 
for Scale Factor (see below) 1, 10,
+&lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
 &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
@@ -72,9 +72,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we 
could not measure
 DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
 available on our test machine. The testing methodology is in the
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;This blog explains what TPCH is, how we ported the vintage C data 
generator to
+&lt;p&gt;This blog explains what TPC-H is, how we ported the vintage C data 
generator to
 Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
-of part-time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPC-H data 
in
 &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Try if for yourself&lt;/h1&gt;
 &lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
@@ -86,22 +86,22 @@ $ tpchgen-cli -s &lt;span class="m"&gt;1&lt;/span&gt;
 &lt;span class="c1"&gt;# create SF=10 in Parquet&lt;/span&gt;
 $ tpchgen-cli -s &lt;span class="m"&gt;10&lt;/span&gt; --format&lt;span 
class="o"&gt;=&lt;/span&gt;parquet
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
-&lt;h1&gt;What is TPCH / dbgen?&lt;/h1&gt;
-&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to as TPCH) helps evaluate the
+&lt;h1&gt;What is TPC-H / dbgen?&lt;/h1&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (often referred to as TPCH) helps evaluate the
 performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;, &lt;/em&gt;the kind used to build BI
 dashboards.&lt;/p&gt;
-&lt;p&gt;TPCH has become a de facto standard for analytic systems. While there 
are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
+&lt;p&gt;TPC-H has become a de facto standard for analytic systems. While 
there are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
 known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
 use cases, the majority of analytic database papers and industrial systems 
still
-use TPCH query performance benchmarks as a baseline. You will inevitably find
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
 multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
 search engine.&lt;/p&gt;
 &lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
 systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
 defined a process of formal result verification. More recently, given the broad
 availability of free and open source database systems, it is common for users 
to
-run and verify TPCH performance themselves.&lt;/p&gt;
-&lt;p&gt;TPCH simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
+run and verify TPC-H performance themselves.&lt;/p&gt;
+&lt;p&gt;TPC-H simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
 &lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
@@ -169,10 +169,10 @@ bound on the Scale Factor.&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPCH data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
-&lt;h1&gt;Why do we need a new TPCH Data generator?&lt;/h1&gt;
-&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it
-is used frequently in database performance analysis. To run TPCH, you must 
first
+&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPC-H data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Why do we need a new TPC-H Data generator?&lt;/h1&gt;
+&lt;p&gt;Despite the known limitations of the TPC-H benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
 load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
@@ -180,21 +180,21 @@ load the data, using &lt;code&gt;dbgen&lt;/code&gt;, 
which is not ideal for seve
 &lt;li&gt;It outputs TBL format, which typically requires loading into your 
database (for example, &lt;a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124"&gt;here
 is how to do so&lt;/a&gt; in Apache DataFusion) prior to query.&lt;/li&gt;
 &lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.[^2]&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;&lt;img alt="Time to generate TPCH data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPCH data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
+&lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
 shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-preloaded TPCH data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+preloaded TPC-H data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
 datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH 
extension&lt;/a&gt; 
-for generating TPCH datasets within DuckDB. This is so much easier to use than
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
-Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPCH dataset&lt;/a&gt; as one reason
+Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPC-H dataset&lt;/a&gt; as one reason
 the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
-&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPC-H extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
 &lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
@@ -220,7 +220,7 @@ MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtub
 thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
 &lt;h1&gt;How: The Journey&lt;/h1&gt;
 &lt;p&gt;We did it together as a team in the open over the course of a few 
weeks.
-&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPC-H
 dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
 2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
 and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
@@ -391,7 +391,7 @@ However, as described above, the TBL is annoying to work 
with, because&lt;/p&gt;
 Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
 speed as TBL. While CSV files are far more standard than TBL, they must still 
be
 parsed prior to load and automatic type inference may not deduce the types
-needed for the TPCH benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
 &lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
 Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
 &lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
@@ -449,7 +449,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&
 &lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
 &lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made
 something 10x better than currently exists. We hope it inspires more research
-into analytical systems using the TPCH dataset and that people build awesome
+into analytical systems using the TPC-H dataset and that people build awesome
 things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
 tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
diff --git a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml 
b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
index 1f08dc0..7e99eb9 100644
--- a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
+++ b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.rss.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb, 
Achraf B, and Sean 
Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>`tpchgen-rs` World’s 
fastest open source TPCH data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
+<rss version="2.0"><channel><title>Apache DataFusion Blog - Andrew Lamb, 
Achraf B, and Sean 
Smith</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
 10 Apr 2025 00:00:00 +0000</lastBuildDate><item><title>tpchgen-rs World’s 
fastest open source TPC-H data generator, written in 
Rust</title><link>https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator</link><description>&lt;!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -26,6 +26,6 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now …&lt;/p&gt;</description><dc:creator 
xmlns:dc="http://purl.org/dc/elements/1.1/";>Andrew Lamb, Achraf B, and Sean 
Smith</dc:creator><pubDate>Thu, 10 Apr 2025 00:00:00 +0000</pubDate><guid 
isPermaLink="false">tag:datafusion.apache.org,2025-04-10:/blog/2025/04/10/fastest-tpch-generator</guid><category>blog</category></item></channel></rss>
\ No newline at end of file
diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml
index 81852fe..3fbdc65 100644
--- a/blog/feeds/blog.atom.xml
+++ b/blog/feeds/blog.atom.xml
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>`tpchgen-rs`
 World’s fastest open source TPCH data generator, written in Rust</title><link 
href="https://datafusion.apache.or [...]
+<feed xmlns="http://www.w3.org/2005/Atom";><title>Apache DataFusion Blog - 
blog</title><link href="https://datafusion.apache.org/blog/"; 
rel="alternate"></link><link 
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"; 
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-04-10T00:00:00+00:00</updated><subtitle></subtitle><entry><title>tpchgen-rs
 World’s fastest open source TPC-H data generator, written in Rust</title><link 
href="https://datafusion.apache.org [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -26,7 +26,7 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now …&lt;/p&gt;</summary><content type="html">&lt;!--
 {% comment %}
@@ -55,16 +55,16 @@ th, td {
 }
 &lt;/style&gt;
 &lt;p&gt;3 members of the &lt;a 
href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; community 
used Rust and open source
-development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPCH data generator over 10x
+development to build &lt;a 
href="https://github.com/alamb/tpchgen-rs"&gt;tpchgen-rs&lt;/a&gt;, a fully 
open TPC-H data generator over 10x
 faster than any other implementation  we know of.&lt;/p&gt;
-&lt;p&gt;It is now possible to create the TPCH SF=100 dataset in 72.23 seconds 
(1.4 GB/s
+&lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
 which takes 30 minutes[^1] (0.05GB/sec). On the same machine, it takes less 
than
 2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format.&lt;/p&gt;
-&lt;p&gt;Finally, it is convenient and efficient to run TPCH queries locally 
when testing
+&lt;p&gt;Finally, it is convenient and efficient to run TPC-H queries locally 
when testing
 analytical engines such as DataFusion.&lt;/p&gt;
-&lt;p&gt;&lt;img alt="Time to create TPCH parquet dataset for Scale Factor  1, 
10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPCH dataset 
for Scale Factor (see below) 1, 10,
+&lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
 VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
 &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
@@ -72,9 +72,9 @@ minutes and 26 and uses about 5 GB of RAM at peak, and we 
could not measure
 DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
 available on our test machine. The testing methodology is in the
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md"&gt;documentation&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;This blog explains what TPCH is, how we ported the vintage C data 
generator to
+&lt;p&gt;This blog explains what TPC-H is, how we ported the vintage C data 
generator to
 Rust (yes, &lt;a 
href="https://www.reddit.com/r/rust/comments/4ri2gn/riir_rewrite_it_in_rust/"&gt;RWIR&lt;/a&gt;)
 and optimized its performance over the course of a few weeks
-of part-time work. We began this project so we can easily generate TPCH data in
+of part-time work. We began this project so we can easily generate TPC-H data 
in
 &lt;a href="https://datafusion.apache.org/"&gt;Apache DataFusion&lt;/a&gt; and 
&lt;a href="https://glaredb.com/"&gt;GlareDB&lt;/a&gt;.&lt;/p&gt;
 &lt;h1&gt;Try if for yourself&lt;/h1&gt;
 &lt;p&gt;The tool is entirely open source under the &lt;a 
href="https://www.apache.org/licenses/LICENSE-2.0"&gt;Apache 2.0 
license&lt;/a&gt;. Visit the &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;tpchgen-rs 
repository&lt;/a&gt; or try it for yourself by run the following commands after 
&lt;a href="https://www.rust-lang.org/tools/install"&gt;installing 
Rust&lt;/a&gt;:&lt;/p&gt;
@@ -86,22 +86,22 @@ $ tpchgen-cli -s &lt;span class="m"&gt;1&lt;/span&gt;
 &lt;span class="c1"&gt;# create SF=10 in Parquet&lt;/span&gt;
 $ tpchgen-cli -s &lt;span class="m"&gt;10&lt;/span&gt; --format&lt;span 
class="o"&gt;=&lt;/span&gt;parquet
 &lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
-&lt;h1&gt;What is TPCH / dbgen?&lt;/h1&gt;
-&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (commonly referred to as TPCH) helps evaluate the
+&lt;h1&gt;What is TPC-H / dbgen?&lt;/h1&gt;
+&lt;p&gt;The popular &lt;a href="https://www.tpc.org/tpch/"&gt;TPC-H&lt;/a&gt; 
benchmark (often referred to as TPCH) helps evaluate the
 performance of database systems on &lt;a 
href="https://en.wikipedia.org/wiki/Online_analytical_processing"&gt;OLAP&lt;/a&gt;
 queries&lt;em&gt;, &lt;/em&gt;the kind used to build BI
 dashboards.&lt;/p&gt;
-&lt;p&gt;TPCH has become a de facto standard for analytic systems. While there 
are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
+&lt;p&gt;TPC-H has become a de facto standard for analytic systems. While 
there are &lt;a href="https://www.vldb.org/pvldb/vol9/p204-leis.pdf"&gt;well
 known&lt;/a&gt; limitations as the data and queries do not well represent many 
real world
 use cases, the majority of analytic database papers and industrial systems 
still
-use TPCH query performance benchmarks as a baseline. You will inevitably find
+use TPC-H query performance benchmarks as a baseline. You will inevitably find
 multiple results for  &amp;ldquo;&lt;code&gt;TPCH Performance &amp;amp;lt;your 
favorite database&amp;gt;&lt;/code&gt;&amp;rdquo; in any
 search engine.&lt;/p&gt;
 &lt;p&gt;The benchmark was created at a time when access to high performance 
analytical
 systems was not widespread, so the &lt;a 
href="https://www.tpc.org/"&gt;Transaction Processing Performance 
Council&lt;/a&gt;
 defined a process of formal result verification. More recently, given the broad
 availability of free and open source database systems, it is common for users 
to
-run and verify TPCH performance themselves.&lt;/p&gt;
-&lt;p&gt;TPCH simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
+run and verify TPC-H performance themselves.&lt;/p&gt;
+&lt;p&gt;TPC-H simulates a business environment with eight tables: 
&lt;code&gt;REGION&lt;/code&gt;, &lt;code&gt;NATION&lt;/code&gt;,
 &lt;code&gt;SUPPLIER&lt;/code&gt;, &lt;code&gt;CUSTOMER&lt;/code&gt;, 
&lt;code&gt;PART&lt;/code&gt;, &lt;code&gt;PARTSUPP&lt;/code&gt;, 
&lt;code&gt;ORDERS&lt;/code&gt;, and &lt;code&gt;LINEITEM&lt;/code&gt;. These
 tables are linked by foreign keys in a normalized schema representing a supply
 chain with parts, suppliers, customers and orders. The benchmark itself is 22
@@ -169,10 +169,10 @@ bound on the Scale Factor.&lt;/p&gt;
    &lt;/td&gt;
 &lt;/tr&gt;
 &lt;/table&gt;
-&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPCH data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
-&lt;h1&gt;Why do we need a new TPCH Data generator?&lt;/h1&gt;
-&lt;p&gt;Despite the known limitations of the TPCH benchmark, it is so well 
known that it
-is used frequently in database performance analysis. To run TPCH, you must 
first
+&lt;p&gt;&lt;strong&gt;Table 1&lt;/strong&gt;: TPC-H data set sizes at 
different scale factors for both TBL and &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;.&lt;/p&gt;
+&lt;h1&gt;Why do we need a new TPC-H Data generator?&lt;/h1&gt;
+&lt;p&gt;Despite the known limitations of the TPC-H benchmark, it is so well 
known that it
+is used frequently in database performance analysis. To run TPC-H, you must 
first
 load the data, using &lt;code&gt;dbgen&lt;/code&gt;, which is not ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;You must find and compile a copy of the 15+ year old C program (for 
example &lt;a 
href="https://github.com/electrum/tpch-dbgen"&gt;electrum/tpch-dbgen&lt;/a&gt;)&lt;/li&gt;
@@ -180,21 +180,21 @@ load the data, using &lt;code&gt;dbgen&lt;/code&gt;, 
which is not ideal for seve
 &lt;li&gt;It outputs TBL format, which typically requires loading into your 
database (for example, &lt;a 
href="https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124"&gt;here
 is how to do so&lt;/a&gt; in Apache DataFusion) prior to query.&lt;/li&gt;
 &lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.[^2]&lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;&lt;img alt="Time to generate TPCH data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPCH data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
+&lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
 shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-preloaded TPCH data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+preloaded TPC-H data, for example [Snowflake Sample Data], &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
 datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPCH 
extension&lt;/a&gt; 
-for generating TPCH datasets within DuckDB. This is so much easier to use than
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
-Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPCH dataset&lt;/a&gt; as one reason
+Lim&lt;/a&gt;explicitly &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;mentioned the ease 
of creating the TPC-H dataset&lt;/a&gt; as one reason
 the first student project of &lt;a 
href="https://15799.courses.cs.cmu.edu/spring2025/"&gt;CMU-799 Spring 
2025&lt;/a&gt; used DuckDB.&lt;/p&gt;
-&lt;p&gt;As beneficial as the DuckDB TPCH extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
+&lt;p&gt;As beneficial as the DuckDB TPC-H extension is, it is non-ideal for 
several reasons:&lt;/p&gt;
 &lt;ol&gt;
 &lt;li&gt;Creates data in a proprietary format, which requires export to use 
in other systems.&lt;/li&gt;
 &lt;li&gt;Requires significant time (e.g. 17 minutes for Scale Factor 
10).&lt;/li&gt;
@@ -220,7 +220,7 @@ MiDAS Fall 2024 (Data Systems Seminar)&lt;/a&gt;, &lt;a 
href="https://www.youtub
 thanks to &lt;a 
href="https://x.com/KurtFehlhauer"&gt;@KurtFehlhauer&lt;/a&gt;&lt;/p&gt;
 &lt;h1&gt;How: The Journey&lt;/h1&gt;
 &lt;p&gt;We did it together as a team in the open over the course of a few 
weeks.
-&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPCH
+&lt;a href="https://github.com/lmwnshn"&gt;Wan Shen Lim&lt;/a&gt; inspired the 
project by pointing out the benefits of &lt;a 
href="https://github.com/apache/datafusion/issues/14373"&gt;easy TPC-H
 dataset creation&lt;/a&gt;  and &lt;a 
href="https://github.com/apache/datafusion/issues/14608#issuecomment-2651044600"&gt;suggesting
 we check out a Java port on February 11,
 2025&lt;/a&gt;. Achraf made &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/53d3402680422a15349ece0a7ea3c3f001018ba0"&gt;first
 commit a few days later&lt;/a&gt; on February 16, and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/commit/9bb386a4c55b8cf93ffac1b98f29b5da990ee79e"&gt;Andrew
 and Sean started helping on March 8, 2025&lt;/a&gt; and we &lt;a 
href="https://crates.io/crates/tpchgen/0.1.0"&gt;released version 0.1&lt;/a&gt; 
on
@@ -391,7 +391,7 @@ However, as described above, the TBL is annoying to work 
with, because&lt;/p&gt;
 Datalayers for finding and &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/73"&gt;fixing&lt;/a&gt; 
&lt;a 
href="https://github.com/clflushopt/tpchgen-rs/issues/65"&gt;bugs&lt;/a&gt;) 
which performs at the same
 speed as TBL. While CSV files are far more standard than TBL, they must still 
be
 parsed prior to load and automatic type inference may not deduce the types
-needed for the TPCH benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
+needed for the TPC-H benchmarks (e.g. floating point vs Decimal).&lt;/p&gt;
 &lt;p&gt;What would be far more useful is a typed, efficient columnar format 
such as
 Apache Parquet which is supported by all modern query engines. So we &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/71"&gt;made&lt;/a&gt; a
 &lt;a href="https://crates.io/crates/tpchgen-arrow"&gt;tpchgen-arrow&lt;/a&gt; 
crate to create &lt;a href="https://arrow.apache.org/"&gt;Apache 
Arrow&lt;/a&gt; arrays directly and then &lt;a 
href="https://github.com/clflushopt/tpchgen-rs/pull/61"&gt;a small
@@ -449,7 +449,7 @@ target/release/tpchgen-cli -s $SCALE_FACTOR 
--format=parquet&lt;/code&gt;&lt;/p&
 &lt;h1&gt;Conclusion 👊🎤&lt;/h1&gt;
 &lt;p&gt;With a few days, some fellow database nerds, and the power of Rust we 
made
 something 10x better than currently exists. We hope it inspires more research
-into analytical systems using the TPCH dataset and that people build awesome
+into analytical systems using the TPC-H dataset and that people build awesome
 things with it. For example, Sean has already added &lt;a 
href="https://github.com/GlareDB/glaredb/pull/3549"&gt;on-demand generation of
 tables to GlareDB&lt;/a&gt;. Please consider joining us and helping out at
 &lt;a 
href="https://github.com/clflushopt/tpchgen-rs"&gt;https://github.com/clflushopt/tpchgen-rs&lt;/a&gt;.&lt;/p&gt;
diff --git a/blog/index.html b/blog/index.html
index 0c8d237..a6abc75 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -50,7 +50,7 @@
             <article class="post">
                 <header>
                     <div class="title">
-                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">`tpchgen-rs` World’s fastest 
open source TPCH data generator, written in Rust</a></h1>
+                        <h1><a 
href="/blog/2025/04/10/fastest-tpch-generator">tpchgen-rs World’s fastest open 
source TPC-H data generator, written in Rust</a></h1>
                         <p>Posted on: Thu 10 April 2025 by Andrew Lamb, Achraf 
B, and Sean Smith</p>
                         <p><!--
 {% comment %}
@@ -79,7 +79,7 @@ th, td {
 }
 </style>
 <p>3 members of the <a href="https://datafusion.apache.org/";>Apache 
DataFusion</a> community used Rust and open source
-development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPCH 
data generator over 10x
+development to build <a 
href="https://github.com/alamb/tpchgen-rs";>tpchgen-rs</a>, a fully open TPC-H 
data generator over 10x
 faster than any other implementation  we know of.</p>
 <p>It is now …</p></p>
                         <footer>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) branch asf-staging updated: Commit build products

Reply via email to