(datafusion-site) branch asf-staging updated: Commit build products

github-bot Sun, 06 Apr 2025 03:51:49 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-staging by this push:
     new 866d083  Commit build products
866d083 is described below

commit 866d083cb14d9b342405be5ceb90a8e6aab6c792
Author: Build Pelican (action) <priv...@infra.apache.org>
AuthorDate: Sun Apr 6 10:51:25 2025 +0000

    Commit build products
---
 blog/2025/04/10/fastest-tpch-generator/index.html       | 10 +++++-----
 blog/feeds/all-en.atom.xml                              | 10 +++++-----
 blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml | 10 +++++-----
 blog/feeds/blog.atom.xml                                | 10 +++++-----
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/blog/2025/04/10/fastest-tpch-generator/index.html 
b/blog/2025/04/10/fastest-tpch-generator/index.html
index e8b1c16..c4f8c22 100644
--- a/blog/2025/04/10/fastest-tpch-generator/index.html
+++ b/blog/2025/04/10/fastest-tpch-generator/index.html
@@ -75,13 +75,13 @@ faster than any other implementation  we know of.</p>
 <p>It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 
GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
<code>dbgen</code>
 which takes 30 minutes<sup>1</sup> (0.05GB/sec). On the same machine, it takes 
less than
-2 minutes to create all 3.6 GB of SF=100 in <a 
href="https://parquet.apache.org/";>Apache Parquet</a> format.
+2 minutes to create all 3.6 GB of SF=100 in <a 
href="https://parquet.apache.org/";>Apache Parquet</a> format, which takes 44 
minutes using <a href="https://duckdb.org";>DuckDB</a>.
 It is finally convenient and efficient to run TPC-H queries locally when 
testing
 analytical engines such as DataFusion.</p>
 <p><img alt="Time to create TPC-H parquet dataset for Scale Factor  1, 10, 100 
and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/></p>
 <p><strong>Figure 1</strong>: Time to create TPC-H dataset for Scale Factor 
(see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
-VM. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 minute and 14 
seconds and
+VM with 88GB of memory. For Scale Factor(SF) 100 <code>tpchgen</code> takes 1 
minute and 14 seconds and
 <a href="https://duckdb.org";>DuckDB</a> takes 17 minutes and 48 seconds. For 
SF=1000, <code>tpchgen</code> takes 10
 minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
 DuckDB&rsquo;s time as it <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator";>requires
 647 GB of RAM</a>, more than the 88 GB that was
@@ -196,14 +196,14 @@ load the data, using <code>dbgen</code>, which is not 
ideal for several reasons:
 <li>The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.<sup>2</sup></li>
 </ol>
 <p><img alt="Time to generate TPC-H data in TBL format" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" width="80%"/></p>
-<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. The 
default <code>tpchgen</code> is
+<p><strong>Figure 3</strong>: Time to generate TPC-H data in TBL format. 
<code>tpchgen</code> is
 shown in blue. <code>tpchgen</code> restricted to a single core is shown in 
red. Unmodified
 <code>dbgen</code> is shown in green and <code>dbgen</code> modified to use 
<code>-O3</code> optimization level
 is shown in yellow.</p>
 <p><code>dbgen</code> is so inconvenient and takes so long that vendors often 
provide
-preloaded TPC-H data, for example <a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch";>Snowflake 
Sample Data</a>, <a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets";>DataBricks
 Sample
+preloaded TPC-H data, for example <a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch";>Snowflake 
Sample Data</a>, <a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets";>Databricks
 Sample
 datasets</a> and <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets";>DuckDB
 Pre-Generated Data Sets</a>.</p>
-<p>In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+<p>In addition to pre-generated datasets, DuckDB also provides a <a 
href="https://duckdb.org/docs/stable/extensions/tpch.html";>TPC-H extension</a> 
 for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, <a 
href="https://github.com/lmwnshn";>Wan Shen
diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml
index e45d6e1..8079d34 100644
--- a/blog/feeds/all-en.atom.xml
+++ b/blog/feeds/all-en.atom.xml
@@ -60,13 +60,13 @@ faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
 which takes 30 minutes&lt;sup&gt;1&lt;/sup&gt; (0.05GB/sec). On the same 
machine, it takes less than
-2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format.
+2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format, which 
takes 44 minutes using &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt;.
 It is finally convenient and efficient to run TPC-H queries locally when 
testing
 analytical engines such as DataFusion.&lt;/p&gt;
 &lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
-VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
+VM with 88GB of memory. For Scale Factor(SF) 100 
&lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute and 14 seconds and
 &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
 minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
 DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
@@ -181,14 +181,14 @@ load the data, using &lt;code&gt;dbgen&lt;/code&gt;, 
which is not ideal for seve
 &lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.&lt;sup&gt;2&lt;/sup&gt;&lt;/li&gt;
 &lt;/ol&gt;
 &lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. &lt;code&gt;tpchgen&lt;/code&gt; is
 shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;Databricks
 Sample
 datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPC-H 
extension&lt;/a&gt; 
 for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
diff --git a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml 
b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
index 44af306..714750a 100644
--- a/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
+++ b/blog/feeds/andrew-lamb-achraf-b-and-sean-smith.atom.xml
@@ -60,13 +60,13 @@ faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
 which takes 30 minutes&lt;sup&gt;1&lt;/sup&gt; (0.05GB/sec). On the same 
machine, it takes less than
-2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format.
+2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format, which 
takes 44 minutes using &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt;.
 It is finally convenient and efficient to run TPC-H queries locally when 
testing
 analytical engines such as DataFusion.&lt;/p&gt;
 &lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
-VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
+VM with 88GB of memory. For Scale Factor(SF) 100 
&lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute and 14 seconds and
 &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
 minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
 DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
@@ -181,14 +181,14 @@ load the data, using &lt;code&gt;dbgen&lt;/code&gt;, 
which is not ideal for seve
 &lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.&lt;sup&gt;2&lt;/sup&gt;&lt;/li&gt;
 &lt;/ol&gt;
 &lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. &lt;code&gt;tpchgen&lt;/code&gt; is
 shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;Databricks
 Sample
 datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPC-H 
extension&lt;/a&gt; 
 for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen
diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml
index 5d62df1..7ae6981 100644
--- a/blog/feeds/blog.atom.xml
+++ b/blog/feeds/blog.atom.xml
@@ -60,13 +60,13 @@ faster than any other implementation  we know of.&lt;/p&gt;
 &lt;p&gt;It is now possible to create the TPC-H SF=100 dataset in 72.23 
seconds (1.4 GB/s
 😎) on a Macbook Air M3 with 16GB of memory, compared to the classic 
&lt;code&gt;dbgen&lt;/code&gt;
 which takes 30 minutes&lt;sup&gt;1&lt;/sup&gt; (0.05GB/sec). On the same 
machine, it takes less than
-2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format.
+2 minutes to create all 3.6 GB of SF=100 in &lt;a 
href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; format, which 
takes 44 minutes using &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt;.
 It is finally convenient and efficient to run TPC-H queries locally when 
testing
 analytical engines such as DataFusion.&lt;/p&gt;
 &lt;p&gt;&lt;img alt="Time to create TPC-H parquet dataset for Scale Factor  
1, 10, 100 and 1000" class="img-responsive" 
src="/blog/images/fastest-tpch-generator/parquet-performance.png" 
width="80%"/&gt;&lt;/p&gt;
 &lt;p&gt;&lt;strong&gt;Figure 1&lt;/strong&gt;: Time to create TPC-H dataset 
for Scale Factor (see below) 1, 10,
 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core 
GCP
-VM. For Scale Factor(SF) 100 &lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute 
and 14 seconds and
+VM with 88GB of memory. For Scale Factor(SF) 100 
&lt;code&gt;tpchgen&lt;/code&gt; takes 1 minute and 14 seconds and
 &lt;a href="https://duckdb.org"&gt;DuckDB&lt;/a&gt; takes 17 minutes and 48 
seconds. For SF=1000, &lt;code&gt;tpchgen&lt;/code&gt; takes 10
 minutes and 26 and uses about 5 GB of RAM at peak, and we could not measure
 DuckDB&amp;rsquo;s time as it &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator"&gt;requires
 647 GB of RAM&lt;/a&gt;, more than the 88 GB that was
@@ -181,14 +181,14 @@ load the data, using &lt;code&gt;dbgen&lt;/code&gt;, 
which is not ideal for seve
 &lt;li&gt;The implementation makes substantial assumptions about the operating 
environment, making it difficult to extend or embed into other 
systems.&lt;sup&gt;2&lt;/sup&gt;&lt;/li&gt;
 &lt;/ol&gt;
 &lt;p&gt;&lt;img alt="Time to generate TPC-H data in TBL format" 
class="img-responsive" 
src="/blog/images/fastest-tpch-generator/tbl-performance.png" 
width="80%"/&gt;&lt;/p&gt;
-&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. The default &lt;code&gt;tpchgen&lt;/code&gt; is
+&lt;p&gt;&lt;strong&gt;Figure 3&lt;/strong&gt;: Time to generate TPC-H data in 
TBL format. &lt;code&gt;tpchgen&lt;/code&gt; is
 shown in blue. &lt;code&gt;tpchgen&lt;/code&gt; restricted to a single core is 
shown in red. Unmodified
 &lt;code&gt;dbgen&lt;/code&gt; is shown in green and 
&lt;code&gt;dbgen&lt;/code&gt; modified to use &lt;code&gt;-O3&lt;/code&gt; 
optimization level
 is shown in yellow.&lt;/p&gt;
 &lt;p&gt;&lt;code&gt;dbgen&lt;/code&gt; is so inconvenient and takes so long 
that vendors often provide
-preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;DataBricks
 Sample
+preloaded TPC-H data, for example &lt;a 
href="https://docs.snowflake.com/en/user-guide/sample-data-tpch"&gt;Snowflake 
Sample Data&lt;/a&gt;, &lt;a 
href="https://docs.databricks.com/aws/en/discover/databricks-datasets"&gt;Databricks
 Sample
 datasets&lt;/a&gt; and &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html#pre-generated-data-sets"&gt;DuckDB
 Pre-Generated Data Sets&lt;/a&gt;.&lt;/p&gt;
-&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a [TPCH 
extension] 
+&lt;p&gt;In addition to pre-generated datasets, DuckDB also provides a &lt;a 
href="https://duckdb.org/docs/stable/extensions/tpch.html"&gt;TPC-H 
extension&lt;/a&gt; 
 for generating TPC-H datasets within DuckDB. This is so much easier to use than
 the current alternatives that it leads many researchers and other thought
 leaders to use DuckDB to evaluate new ideas. For example, &lt;a 
href="https://github.com/lmwnshn"&gt;Wan Shen


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) branch asf-staging updated: Commit build products

Reply via email to