[hudi] branch asf-site updated: Travis CI build asf-site

vinoth Tue, 01 Sep 2020 12:29:57 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new e109837  Travis CI build asf-site
e109837 is described below

commit e109837f4312ed191cadbbfb5006a4e0f0a92cca
Author: CI <ci...@hudi.apache.org>
AuthorDate: Tue Sep 1 19:29:27 2020 +0000

    Travis CI build asf-site
---
 content/activity.html                              |  48 +++
 .../assets/images/blog/2020-08-20-per-record.png   | Bin 0 -> 5762 bytes
 content/assets/images/blog/2020-08-20-skeleton.png | Bin 0 -> 25778 bytes
 content/assets/js/lunr/lunr-store.js               |  10 +
 content/blog.html                                  |  48 +++
 .../async-compaction-deployment-model/index.html   | 337 +++++++++++++++
 .../index.html                                     | 464 +++++++++++++++++++++
 content/cn/activity.html                           |  48 +++
 content/sitemap.xml                                |   8 +
 9 files changed, 963 insertions(+)

diff --git a/content/activity.html b/content/activity.html
index 46e0160..9a3470c 100644
--- a/content/activity.html
+++ b/content/activity.html
@@ -191,6 +191,54 @@
     
     <h2 class="archive__item-title" itemprop="headline">
       
+        <a href="/blog/async-compaction-deployment-model/" 
rel="permalink">Async Compaction Deployment Models
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-21">August 21, 
2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">Mechanisms for 
executing compaction jobs in Hudi asynchronously
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
+        <a href="/blog/efficient-migration-of-large-parquet-tables/" 
rel="permalink">Efficient Migration of Large Parquet Tables to Apache Hudi
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-20">August 20, 
2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">Migrating a large 
parquet table to Apache Hudi without having to rewrite the entire dataset.
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
         <a href="/blog/hudi-incremental-processing-on-data-lakes/" 
rel="permalink">Incremental Processing on the Data Lake
 </a>
       
diff --git a/content/assets/images/blog/2020-08-20-per-record.png 
b/content/assets/images/blog/2020-08-20-per-record.png
new file mode 100644
index 0000000..aeb56fc
Binary files /dev/null and 
b/content/assets/images/blog/2020-08-20-per-record.png differ
diff --git a/content/assets/images/blog/2020-08-20-skeleton.png 
b/content/assets/images/blog/2020-08-20-skeleton.png
new file mode 100644
index 0000000..66eb052
Binary files /dev/null and b/content/assets/images/blog/2020-08-20-skeleton.png 
differ
diff --git a/content/assets/js/lunr/lunr-store.js 
b/content/assets/js/lunr/lunr-store.js
index 0a94f11..33a9edc 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -1163,4 +1163,14 @@ var store = [{
         "excerpt":"NOTE: This article is a translation of the infoq.cn 
article, found here, with minor edits Apache Hudi is a data lake framework 
which provides the ability to ingest, manage and query large analytical data 
sets on a distributed file system/cloud stores. Hudi joined the Apache 
incubator for incubation in January...","categories": ["blog"],
         "tags": [],
         "url": 
"https://hudi.apache.org/blog/hudi-incremental-processing-on-data-lakes/";,
+        "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
+        "title": "Efficient Migration of Large Parquet Tables to Apache Hudi",
+        "excerpt":"We will look at how to migrate a large parquet table to 
Hudi without having to rewrite the entire dataset. Motivation: Apache Hudi 
maintains per record metadata to perform core operations such as upserts and 
incremental pull. To take advantage of Hudi’s upsert and incremental processing 
support, users would need...","categories": ["blog"],
+        "tags": [],
+        "url": 
"https://hudi.apache.org/blog/efficient-migration-of-large-parquet-tables/";,
+        "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
+        "title": "Async Compaction Deployment Models",
+        "excerpt":"We will look at different deployment models for executing 
compactions asynchronously. Compaction For Merge-On-Read table, data is stored 
using a combination of columnar (e.g parquet) + row based (e.g avro) file 
formats. Updates are logged to delta files &amp; later compacted to produce new 
versions of columnar files synchronously or...","categories": ["blog"],
+        "tags": [],
+        "url": 
"https://hudi.apache.org/blog/async-compaction-deployment-model/";,
         "teaser":"https://hudi.apache.org/assets/images/500x300.png"},]
diff --git a/content/blog.html b/content/blog.html
index db59c19..c9b1eec 100644
--- a/content/blog.html
+++ b/content/blog.html
@@ -189,6 +189,54 @@
     
     <h2 class="archive__item-title" itemprop="headline">
       
+        <a href="/blog/async-compaction-deployment-model/" 
rel="permalink">Async Compaction Deployment Models
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-21">August 21, 
2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">Mechanisms for 
executing compaction jobs in Hudi asynchronously
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
+        <a href="/blog/efficient-migration-of-large-parquet-tables/" 
rel="permalink">Efficient Migration of Large Parquet Tables to Apache Hudi
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-20">August 20, 
2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">Migrating a large 
parquet table to Apache Hudi without having to rewrite the entire dataset.
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
         <a href="/blog/hudi-incremental-processing-on-data-lakes/" 
rel="permalink">Incremental Processing on the Data Lake
 </a>
       
diff --git a/content/blog/async-compaction-deployment-model/index.html 
b/content/blog/async-compaction-deployment-model/index.html
new file mode 100644
index 0000000..7e98038
--- /dev/null
+++ b/content/blog/async-compaction-deployment-model/index.html
@@ -0,0 +1,337 @@
+<!doctype html>
+<html lang="en" class="no-js">
+  <head>
+    <meta charset="utf-8">
+
+<!-- begin _includes/seo.html --><title>Async Compaction Deployment Models - 
Apache Hudi</title>
+<meta name="description" content="Mechanisms for executing compaction jobs in 
Hudi asynchronously">
+
+<meta property="og:type" content="article">
+<meta property="og:locale" content="en_US">
+<meta property="og:site_name" content="">
+<meta property="og:title" content="Async Compaction Deployment Models">
+<meta property="og:url" 
content="https://hudi.apache.org/blog/async-compaction-deployment-model/";>
+
+
+  <meta property="og:description" content="Mechanisms for executing compaction 
jobs in Hudi asynchronously">
+
+
+
+
+
+
+
+
+
+
+
+<!-- end _includes/seo.html -->
+
+
+<!--<link href="/feed.xml" type="application/atom+xml" rel="alternate" title=" 
Feed">-->
+
+<!-- https://t.co/dKP3o1e -->
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+
+<script>
+  document.documentElement.className = 
document.documentElement.className.replace(/\bno-js\b/g, '') + ' js ';
+</script>
+
+<!-- For all browsers -->
+<link rel="stylesheet" href="/assets/css/main.css">
+
+<!--[if IE]>
+  <style>
+    /* old IE unsupported flexbox fixes */
+    .greedy-nav .site-title {
+      padding-right: 3em;
+    }
+    .greedy-nav button {
+      position: absolute;
+      top: 0;
+      right: 0;
+      height: 100%;
+    }
+  </style>
+<![endif]-->
+
+
+
+<link rel="icon" type="image/x-icon" href="/assets/images/favicon.ico">
+<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
+<script src="/assets/js/jquery.min.js"></script>
+
+    
+<script src="/assets/js/main.min.js"></script>
+
+  </head>
+
+  <body class="layout--single">
+    <!--[if lt IE 9]>
+<div class="notice--danger align-center" style="margin: 0;">You are using an 
<strong>outdated</strong> browser. Please <a 
href="https://browsehappy.com/";>upgrade your browser</a> to improve your 
experience.</div>
+<![endif]-->
+
+    <div class="masthead">
+  <div class="masthead__inner-wrap" id="masthead__inner-wrap">
+    <div class="masthead__menu">
+      <nav id="site-nav" class="greedy-nav">
+        
+          <a class="site-logo" href="/">
+              <div style="width: 150px; height: 40px">
+              </div>
+          </a>
+        
+        <a class="site-title" href="/">
+          
+        </a>
+        <ul class="visible-links"><li class="masthead__menu-item">
+              <a href="/docs/quick-start-guide.html" target="_self" 
>Documentation</a>
+            </li><li class="masthead__menu-item">
+              <a href="/community.html" target="_self" >Community</a>
+            </li><li class="masthead__menu-item">
+              <a href="/blog.html" target="_self" >Blog</a>
+            </li><li class="masthead__menu-item">
+              <a href="https://cwiki.apache.org/confluence/display/HUDI/FAQ"; 
target="_blank" >FAQ</a>
+            </li><li class="masthead__menu-item">
+              <a href="/releases.html" target="_self" >Releases</a>
+            </li></ul>
+        <button class="greedy-nav__toggle hidden" type="button">
+          <span class="visually-hidden">Toggle menu</span>
+          <div class="navicon"></div>
+        </button>
+        <ul class="hidden-links hidden"></ul>
+      </nav>
+    </div>
+  </div>
+</div>
+<!--
+<p class="notice--warning" style="margin: 0 !important; text-align: center 
!important;"><strong>Note:</strong> This site is work in progress, if you 
notice any issues, please <a target="_blank" 
href="https://github.com/apache/hudi/issues";>Report on Issue</a>.
+  Click <a href="/"> here</a> back to old site.</p>
+-->
+
+    <div class="initial-content">
+      <div id="main" role="main">
+  
+
+  <div class="sidebar sticky">
+
+  
+    <div itemscope itemtype="https://schema.org/Person";>
+
+  <div class="author__content">
+    
+      <h3 class="author__name" itemprop="name">Quick Links</h3>
+    
+    
+      <div class="author__bio" itemprop="description">
+        <p>Hudi <em>ingests</em> &amp; <em>manages</em> storage of large 
analytical datasets over DFS.</p>
+
+      </div>
+    
+  </div>
+
+  <div class="author__urls-wrapper">
+    <ul class="author__urls social-icons">
+      
+        
+          <li><a href="/docs/quick-start-guide" target="_self" rel="nofollow 
noopener noreferrer"><i class="fa fa-book" aria-hidden="true"></i> 
Documentation</a></li>
+
+          
+        
+          <li><a href="https://cwiki.apache.org/confluence/display/HUDI"; 
target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-wikipedia-w" 
aria-hidden="true"></i> Technical Wiki</a></li>
+
+          
+        
+          <li><a href="/contributing" target="_self" rel="nofollow noopener 
noreferrer"><i class="fa fa-thumbs-o-up" aria-hidden="true"></i> Contribution 
Guide</a></li>
+
+          
+        
+          <li><a 
href="https://join.slack.com/t/apache-hudi/shared_invite/enQtODYyNDAxNzc5MTg2LTE5OTBlYmVhYjM0N2ZhOTJjOWM4YzBmMWU2MjZjMGE4NDc5ZDFiOGQ2N2VkYTVkNzU3ZDQ4OTI1NmFmYWQ0NzE";
 target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-slack" 
aria-hidden="true"></i> Join on Slack</a></li>
+
+          
+        
+          <li><a href="https://github.com/apache/hudi"; target="_blank" 
rel="nofollow noopener noreferrer"><i class="fa fa-github" 
aria-hidden="true"></i> Fork on GitHub</a></li>
+
+          
+        
+          <li><a href="https://issues.apache.org/jira/projects/HUDI/summary"; 
target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-navicon" 
aria-hidden="true"></i> Report Issues</a></li>
+
+          
+        
+          <li><a href="/security" target="_self" rel="nofollow noopener 
noreferrer"><i class="fa fa-navicon" aria-hidden="true"></i> Report Security 
Issues</a></li>
+
+          
+        
+      
+    </ul>
+  </div>
+</div>
+
+  
+
+  
+  </div>
+
+
+  <article class="page" itemscope itemtype="https://schema.org/CreativeWork";>
+    <!-- Look the author details up from the site config. -->
+    
+
+    <div class="page__inner-wrap">
+      
+        <header>
+          <h1 id="page-title" class="page__title" itemprop="headline">Async 
Compaction Deployment Models
+</h1>
+          <!-- Output author details if some exist. -->
+          <div class="page__author"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-21">August 21, 
2020</time></span>
+        </header>
+      
+
+      <section class="page__content" itemprop="text">
+        
+          <style>
+            .page {
+              padding-right: 0 !important;
+            }
+          </style>
+        
+        <p>We will look at different deployment models for executing 
compactions asynchronously.</p>
+
+<h1 id="compaction">Compaction</h1>
+
+<p>For Merge-On-Read table, data is stored using a combination of columnar 
(e.g parquet) + row based (e.g avro) file formats. 
+Updates are logged to delta files &amp; later compacted to produce new 
versions of columnar files synchronously or 
+asynchronously. One of th main motivations behind Merge-On-Read is to reduce 
data latency when ingesting records.
+Hence, it makes sense to run compaction asynchronously without blocking 
ingestion.</p>
+
+<h1 id="async-compaction">Async Compaction</h1>
+
+<p>Async Compaction is performed in 2 steps:</p>
+
+<ol>
+  <li><strong><em>Compaction Scheduling</em></strong>: This is done by the 
ingestion job. In this step, Hudi scans the partitions and selects <strong>file 
+slices</strong> to be compacted. A compaction plan is finally written to Hudi 
timeline.</li>
+  <li><strong><em>Compaction Execution</em></strong>: A separate process reads 
the compaction plan and performs compaction of file slices.</li>
+</ol>
+
+<h1 id="deployment-models">Deployment Models</h1>
+
+<p>There are few ways by which we can execute compactions asynchronously.</p>
+
+<h2 id="spark-structured-streaming">Spark Structured Streaming</h2>
+
+<p>With 0.6.0, we now have support for running async compactions in Spark 
+Structured Streaming jobs. Compactions are scheduled and executed 
asynchronously inside the 
+streaming job.  Async Compactions are enabled by default for structured 
streaming jobs
+on Merge-On-Read table.</p>
+
+<p>Here is an example snippet in java</p>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="err">import</span> <span 
class="err">org.apache.hudi.DataSourceWriteOptions;</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.HoodieDataSourceHelpers;</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.config.HoodieCompactionConfig;</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.config.HoodieWriteConfig;</span>
+
+<span class="err">import</span> <span 
class="err">org.apache.spark.sql.streaming.OutputMode;</span>
+<span class="err">import</span> <span 
class="err">org.apache.spark.sql.streaming.ProcessingTime;</span>
+
+
+ <span class="err">DataStreamWriter&lt;Row&gt;</span> <span 
class="py">writer</span> <span class="p">=</span> <span 
class="s">streamingInput.writeStream().format("org.apache.hudi")</span>
+        <span 
class="err">.option(DataSourceWriteOptions.OPERATION_OPT_KEY(),</span> <span 
class="err">operationType)</span>
+        <span 
class="err">.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(),</span> <span 
class="err">tableType)</span>
+        <span 
class="err">.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),</span> 
<span class="err">"_row_key")</span>
+        <span 
class="err">.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),</span>
 <span class="err">"partition")</span>
+        <span 
class="err">.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),</span> 
<span class="err">"timestamp")</span>
+        <span 
class="err">.option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP,</span>
 <span class="err">"10")</span>
+        <span 
class="err">.option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY(),</span>
 <span class="err">"true")</span>
+        <span class="err">.option(HoodieWriteConfig.TABLE_NAME,</span> <span 
class="err">tableName).option("checkpointLocation",</span> <span 
class="err">checkpointLocation)</span>
+        <span class="err">.outputMode(OutputMode.Append());</span>
+ <span class="err">writer.trigger(new</span> <span 
class="err">ProcessingTime(30000)).start(tablePath);</span>
+</code></pre></div></div>
+
+<h2 id="deltastreamer-continuous-mode">DeltaStreamer Continuous Mode</h2>
+<p>Hudi DeltaStreamer provides continuous ingestion mode where a single long 
running spark application<br />
+ingests data to Hudi table continuously from upstream sources. In this mode, 
Hudi supports managing asynchronous 
+compactions. Here is an example snippet for running in continuous mode with 
async compactions</p>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="err">spark-submit</span> <span 
class="err">--packages</span> <span class="py">org.apache.hudi</span><span 
class="p">:</span><span class="s">hudi-utilities-bundle_2.11:0.6.0 </span><span 
class="se">\
+</span><span class="s">--class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer </span><span 
class="se">\
+</span><span class="s">--table-type MERGE_ON_READ </span><span class="se">\
+</span><span class="s">--target-base-path &lt;hudi_base_path&gt; </span><span 
class="se">\
+</span><span class="s">--target-table &lt;hudi_table&gt; </span><span 
class="se">\
+</span><span class="s">--source-class 
org.apache.hudi.utilities.sources.JsonDFSSource </span><span class="se">\
+</span><span class="s">--source-ordering-field ts </span><span class="se">\
+</span><span class="s">--schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider </span><span 
class="se">\
+</span><span class="s">--props /path/to/source.properties </span><span 
class="se">\
+</span><span class="s">--continous</span>
+</code></pre></div></div>
+
+<h2 id="hudi-cli">Hudi CLI</h2>
+<p>Hudi CLI is yet another way to execute specific compactions asynchronously. 
Here is an example</p>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="py">hudi</span><span 
class="p">:</span><span class="s">trips-&gt;compaction run --tableName 
&lt;table_name&gt; --parallelism &lt;parallelism&gt; --compactionInstant 
&lt;InstantTime&gt;</span>
+<span class="err">...</span>
+</code></pre></div></div>
+
+<h2 id="hudi-compactor-script">Hudi Compactor Script</h2>
+<p>Hudi provides a standalone tool to also execute specific compactions 
asynchronously. Here is an example</p>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="err">spark-submit</span> <span 
class="err">--packages</span> <span class="py">org.apache.hudi</span><span 
class="p">:</span><span class="s">hudi-utilities-bundle_2.11:0.6.0 </span><span 
class="se">\
+</span><span class="s">--class org.apache.hudi.utilities.HoodieCompactor 
</span><span class="se">\
+</span><span class="s">--base-path &lt;base_path&gt; </span><span class="se">\
+</span><span class="s">--table-name &lt;table_name&gt; </span><span 
class="se">\
+</span><span class="s">--instant-time &lt;compaction_instant&gt; </span><span 
class="se">\
+</span><span class="s">--schema-file &lt;schema_file&gt;</span>
+</code></pre></div></div>
+
+      </section>
+
+      <a href="#masthead__inner-wrap" class="back-to-top">Back to top 
&uarr;</a>
+
+
+      
+
+    </div>
+
+  </article>
+
+</div>
+
+    </div>
+
+    <div class="page__footer">
+      <footer>
+        
+<div class="row">
+  <div class="col-lg-12 footer">
+    <p>
+      <table class="table-apache-info">
+        <tr>
+          <td>
+            <a class="footer-link-img" href="https://apache.org";>
+              <img width="250px" src="/assets/images/asf_logo.svg" alt="The 
Apache Software Foundation">
+            </a>
+          </td>
+          <td>
+            <a style="float: right" 
href="https://www.apache.org/events/current-event.html";>
+              <img 
src="https://www.apache.org/events/current-event-234x60.png"; />
+            </a>
+          </td>
+        </tr>
+      </table>
+    </p>
+    <p>
+      <a href="https://www.apache.org/licenses/";>License</a> | <a 
href="https://www.apache.org/security/";>Security</a> | <a 
href="https://www.apache.org/foundation/thanks.html";>Thanks</a> | <a 
href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship</a>
+    </p>
+    <p>
+      Copyright &copy; <span id="copyright-year">2019</span> <a 
href="https://apache.org";>The Apache Software Foundation</a>, Licensed under 
the <a href="https://www.apache.org/licenses/LICENSE-2.0";> Apache License, 
Version 2.0</a>.
+      Hudi, Apache and the Apache feather logo are trademarks of The Apache 
Software Foundation. <a href="/docs/privacy">Privacy Policy</a>
+    </p>
+  </div>
+</div>
+      </footer>
+    </div>
+
+
+  </body>
+</html>
\ No newline at end of file
diff --git 
a/content/blog/efficient-migration-of-large-parquet-tables/index.html 
b/content/blog/efficient-migration-of-large-parquet-tables/index.html
new file mode 100644
index 0000000..0c564e9
--- /dev/null
+++ b/content/blog/efficient-migration-of-large-parquet-tables/index.html
@@ -0,0 +1,464 @@
+<!doctype html>
+<html lang="en" class="no-js">
+  <head>
+    <meta charset="utf-8">
+
+<!-- begin _includes/seo.html --><title>Efficient Migration of Large Parquet 
Tables to Apache Hudi - Apache Hudi</title>
+<meta name="description" content="Migrating a large parquet table to Apache 
Hudi without having to rewrite the entire dataset.">
+
+<meta property="og:type" content="article">
+<meta property="og:locale" content="en_US">
+<meta property="og:site_name" content="">
+<meta property="og:title" content="Efficient Migration of Large Parquet Tables 
to Apache Hudi">
+<meta property="og:url" 
content="https://hudi.apache.org/blog/efficient-migration-of-large-parquet-tables/";>
+
+
+  <meta property="og:description" content="Migrating a large parquet table to 
Apache Hudi without having to rewrite the entire dataset.">
+
+
+
+
+
+
+
+
+
+
+
+<!-- end _includes/seo.html -->
+
+
+<!--<link href="/feed.xml" type="application/atom+xml" rel="alternate" title=" 
Feed">-->
+
+<!-- https://t.co/dKP3o1e -->
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+
+<script>
+  document.documentElement.className = 
document.documentElement.className.replace(/\bno-js\b/g, '') + ' js ';
+</script>
+
+<!-- For all browsers -->
+<link rel="stylesheet" href="/assets/css/main.css">
+
+<!--[if IE]>
+  <style>
+    /* old IE unsupported flexbox fixes */
+    .greedy-nav .site-title {
+      padding-right: 3em;
+    }
+    .greedy-nav button {
+      position: absolute;
+      top: 0;
+      right: 0;
+      height: 100%;
+    }
+  </style>
+<![endif]-->
+
+
+
+<link rel="icon" type="image/x-icon" href="/assets/images/favicon.ico">
+<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
+<script src="/assets/js/jquery.min.js"></script>
+
+    
+<script src="/assets/js/main.min.js"></script>
+
+  </head>
+
+  <body class="layout--single">
+    <!--[if lt IE 9]>
+<div class="notice--danger align-center" style="margin: 0;">You are using an 
<strong>outdated</strong> browser. Please <a 
href="https://browsehappy.com/";>upgrade your browser</a> to improve your 
experience.</div>
+<![endif]-->
+
+    <div class="masthead">
+  <div class="masthead__inner-wrap" id="masthead__inner-wrap">
+    <div class="masthead__menu">
+      <nav id="site-nav" class="greedy-nav">
+        
+          <a class="site-logo" href="/">
+              <div style="width: 150px; height: 40px">
+              </div>
+          </a>
+        
+        <a class="site-title" href="/">
+          
+        </a>
+        <ul class="visible-links"><li class="masthead__menu-item">
+              <a href="/docs/quick-start-guide.html" target="_self" 
>Documentation</a>
+            </li><li class="masthead__menu-item">
+              <a href="/community.html" target="_self" >Community</a>
+            </li><li class="masthead__menu-item">
+              <a href="/blog.html" target="_self" >Blog</a>
+            </li><li class="masthead__menu-item">
+              <a href="https://cwiki.apache.org/confluence/display/HUDI/FAQ"; 
target="_blank" >FAQ</a>
+            </li><li class="masthead__menu-item">
+              <a href="/releases.html" target="_self" >Releases</a>
+            </li></ul>
+        <button class="greedy-nav__toggle hidden" type="button">
+          <span class="visually-hidden">Toggle menu</span>
+          <div class="navicon"></div>
+        </button>
+        <ul class="hidden-links hidden"></ul>
+      </nav>
+    </div>
+  </div>
+</div>
+<!--
+<p class="notice--warning" style="margin: 0 !important; text-align: center 
!important;"><strong>Note:</strong> This site is work in progress, if you 
notice any issues, please <a target="_blank" 
href="https://github.com/apache/hudi/issues";>Report on Issue</a>.
+  Click <a href="/"> here</a> back to old site.</p>
+-->
+
+    <div class="initial-content">
+      <div id="main" role="main">
+  
+
+  <div class="sidebar sticky">
+
+  
+    <div itemscope itemtype="https://schema.org/Person";>
+
+  <div class="author__content">
+    
+      <h3 class="author__name" itemprop="name">Quick Links</h3>
+    
+    
+      <div class="author__bio" itemprop="description">
+        <p>Hudi <em>ingests</em> &amp; <em>manages</em> storage of large 
analytical datasets over DFS.</p>
+
+      </div>
+    
+  </div>
+
+  <div class="author__urls-wrapper">
+    <ul class="author__urls social-icons">
+      
+        
+          <li><a href="/docs/quick-start-guide" target="_self" rel="nofollow 
noopener noreferrer"><i class="fa fa-book" aria-hidden="true"></i> 
Documentation</a></li>
+
+          
+        
+          <li><a href="https://cwiki.apache.org/confluence/display/HUDI"; 
target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-wikipedia-w" 
aria-hidden="true"></i> Technical Wiki</a></li>
+
+          
+        
+          <li><a href="/contributing" target="_self" rel="nofollow noopener 
noreferrer"><i class="fa fa-thumbs-o-up" aria-hidden="true"></i> Contribution 
Guide</a></li>
+
+          
+        
+          <li><a 
href="https://join.slack.com/t/apache-hudi/shared_invite/enQtODYyNDAxNzc5MTg2LTE5OTBlYmVhYjM0N2ZhOTJjOWM4YzBmMWU2MjZjMGE4NDc5ZDFiOGQ2N2VkYTVkNzU3ZDQ4OTI1NmFmYWQ0NzE";
 target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-slack" 
aria-hidden="true"></i> Join on Slack</a></li>
+
+          
+        
+          <li><a href="https://github.com/apache/hudi"; target="_blank" 
rel="nofollow noopener noreferrer"><i class="fa fa-github" 
aria-hidden="true"></i> Fork on GitHub</a></li>
+
+          
+        
+          <li><a href="https://issues.apache.org/jira/projects/HUDI/summary"; 
target="_blank" rel="nofollow noopener noreferrer"><i class="fa fa-navicon" 
aria-hidden="true"></i> Report Issues</a></li>
+
+          
+        
+          <li><a href="/security" target="_self" rel="nofollow noopener 
noreferrer"><i class="fa fa-navicon" aria-hidden="true"></i> Report Security 
Issues</a></li>
+
+          
+        
+      
+    </ul>
+  </div>
+</div>
+
+  
+
+  
+  </div>
+
+
+  <article class="page" itemscope itemtype="https://schema.org/CreativeWork";>
+    <!-- Look the author details up from the site config. -->
+    
+
+    <div class="page__inner-wrap">
+      
+        <header>
+          <h1 id="page-title" class="page__title" 
itemprop="headline">Efficient Migration of Large Parquet Tables to Apache Hudi
+</h1>
+          <!-- Output author details if some exist. -->
+          <div class="page__author"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-20">August 20, 
2020</time></span>
+        </header>
+      
+
+      <section class="page__content" itemprop="text">
+        
+          <style>
+            .page {
+              padding-right: 0 !important;
+            }
+          </style>
+        
+        <p>We will look at how to migrate a large parquet table to Hudi 
without having to rewrite the entire dataset.</p>
+
+<h1 id="motivation">Motivation:</h1>
+
+<p>Apache Hudi maintains per record metadata to perform core operations such 
as upserts and incremental pull. To take advantage of Hudi’s upsert and 
incremental processing support, users would need to rewrite their whole dataset 
to make it an Apache Hudi table.  Hudi 0.6.0 comes with an 
<strong><em>experimental feature</em></strong> to support efficient migration 
of large Parquet tables to Hudi without the need to rewrite the entire 
dataset.</p>
+
+<h1 id="high-level-idea">High Level Idea:</h1>
+
+<h2 id="per-record-metadata">Per Record Metadata:</h2>
+
+<p>Apache Hudi maintains record level metadata for perform efficient upserts 
and incremental pull.</p>
+
+<p><img src="/assets/images/blog/2020-08-20-per-record.png" alt="Per Record 
Metadata" /></p>
+
+<p>Apache HUDI physical file contains 3 parts</p>
+
+<ol>
+  <li>For each record, 5 HUDI metadata fields with column indices 0 to 4</li>
+  <li>For each record, the original data columns that comprises the record 
(Original Data)</li>
+  <li>Additional Hudi Metadata at file footer for index lookup</li>
+</ol>
+
+<p>The parts (1) and (3) constitute what we term as  “Hudi skeleton”. Hudi 
skeleton contains additional metadata that it maintains in each physical 
parquet file for supporting Hudi primitives. The conceptual idea is to decouple 
Hudi skeleton data from original data (2). Hudi skeleton can be stored in a 
Hudi file while the original data is stored in an external non-Hudi file. A 
migration of large parquet would result in creating only Hudi skeleton files 
without having to rewrite original  [...]
+
+<p><img src="/assets/images/blog/2020-08-20-skeleton.png" alt="skeleton" /></p>
+
+<h1 id="design-deep-dive">Design Deep Dive:</h1>
+
+<p>For a deep dive on the internals, please take a look at the <a 
href="https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi";>RFC
 document</a></p>
+
+<h1 id="migration">Migration:</h1>
+
+<p>Hudi supports 2 modes when migrating parquet tables.  We will use the term 
bootstrap and migration interchangeably in this document.</p>
+
+<ul>
+  <li>METADATA_ONLY : In this mode, record level metadata alone is generated 
for each source record and stored in new bootstrap location.</li>
+  <li>FULL_RECORD : In this mode, record level metadata is generated for each 
source record and both original record and metadata for each record copied</li>
+</ul>
+
+<p>You can pick and choose these modes at partition level. One of the common 
strategy would be to use FULL_RECORD mode for a small set of “hot” partitions 
which are accessed more frequently and METADATA_ONLY for a larger set of “warm” 
partitions.</p>
+
+<h2 id="query-engine-support">Query Engine Support:</h2>
+<p>For a METADATA_ONLY bootstrapped table, Spark - data source, Spark-Hive and 
native Hive query engines are supported. Presto support is in the works.</p>
+
+<h2 id="ways-to-migrate-">Ways To Migrate :</h2>
+
+<p>There are 2 ways to migrate a large parquet table to Hudi.</p>
+
+<ul>
+  <li>Spark Datasource Write</li>
+  <li>Hudi DeltaStreamer</li>
+</ul>
+
+<p>We will look at how to migrate using both these approaches.</p>
+
+<h2 id="configurations">Configurations:</h2>
+
+<p>These are bootstrap specific configurations that needs to be set in 
addition to regular hudi write configurations.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Configuration Name</th>
+      <th>Default</th>
+      <th>Mandatory ?</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>hoodie.bootstrap.base.path</td>
+      <td> </td>
+      <td>Yes</td>
+      <td>Base Path of  source parquet table.</td>
+    </tr>
+    <tr>
+      <td>hoodie.bootstrap.parallelism</td>
+      <td>1500</td>
+      <td>Yes</td>
+      <td>Spark Parallelism used when running bootstrap</td>
+    </tr>
+    <tr>
+      <td>hoodie.bootstrap.keygen.class</td>
+      <td> </td>
+      <td>Yes</td>
+      <td>Bootstrap Index internally used by Hudi to map Hudi skeleton and 
source parquet files.</td>
+    </tr>
+    <tr>
+      <td>hoodie.bootstrap.mode.selector</td>
+      
<td>org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector</td>
+      <td>Yes</td>
+      <td>Bootstap Mode Selector class. By default, Hudi employs METADATA_ONLY 
boostrap for all partitions.</td>
+    </tr>
+    <tr>
+      <td>hoodie.bootstrap.partitionpath.translator.class</td>
+      <td>org.apache.hudi.client.bootstrap.translator. 
IdentityBootstrapPartitionPathTranslator</td>
+      <td>No</td>
+      <td>For METADATA_ONLY bootstrap, this class allows customization of 
partition paths used in Hudi target dataset. By default, no customization is 
done and the partition paths reflects what is available in source parquet 
table.</td>
+    </tr>
+    <tr>
+      <td>hoodie.bootstrap.full.input.provider</td>
+      <td>org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider</td>
+      <td>No</td>
+      <td>For FULL_RECORD bootstrap, this class provides the input RDD of Hudi 
records to write.</td>
+    </tr>
+    <tr>
+      <td>hoodie.bootstrap.mode.selector.regex.mode</td>
+      <td>METADATA_ONLY</td>
+      <td>No</td>
+      <td>Bootstrap Mode used when the partition matches the regex pattern in 
hoodie.bootstrap.mode.selector.regex . Used only when 
hoodie.bootstrap.mode.selector set to BootstrapRegexModeSelector.</td>
+    </tr>
+    <tr>
+      <td>hoodie.bootstrap.mode.selector.regex</td>
+      <td>.*</td>
+      <td>No</td>
+      <td>Partition Regex used when  hoodie.bootstrap.mode.selector set to 
BootstrapRegexModeSelector.</td>
+    </tr>
+  </tbody>
+</table>
+
+<h2 id="spark-data-source">Spark Data Source:</h2>
+
+<p>Here, we use a Spark Datasource Write to perform bootstrap. 
+Here is an example code snippet to perform METADATA_ONLY bootstrap.</p>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="err">import</span> <span 
class="err">org.apache.hudi.{DataSourceWriteOptions,</span> <span 
class="err">HoodieDataSourceHelpers}</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.config.{HoodieBootstrapConfig,</span> <span 
class="err">HoodieWriteConfig}</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.keygen.SimpleKeyGenerator</span>
+<span class="err">import</span> <span 
class="err">org.apache.spark.sql.SaveMode</span>
+ 
+<span class="err">val</span> <span class="py">bootstrapDF</span> <span 
class="p">=</span> <span class="s">spark.emptyDataFrame</span>
+<span class="err">bootstrapDF.write</span>
+      <span class="err">.format("hudi")</span>
+      <span class="err">.option(HoodieWriteConfig.TABLE_NAME,</span> <span 
class="err">"hoodie_test")</span>
+      <span 
class="err">.option(DataSourceWriteOptions.OPERATION_OPT_KEY,</span> <span 
class="err">DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)</span>
+      <span 
class="err">.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,</span> 
<span class="err">"_row_key")</span>
+      <span 
class="err">.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,</span> 
<span class="err">"datestr")</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP,</span> 
<span class="err">srcPath)</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS,</span> <span 
class="err">classOf[SimpleKeyGenerator].getName)</span>
+      <span class="err">.mode(SaveMode.Overwrite)</span>
+      <span class="err">.save(basePath)</span>
+</code></pre></div></div>
+
+<p>Here is an example code snippet to perform METADATA_ONLY bootstrap for 
August 20 2020 - August 29 2020 partitions and FULL_RECORD bootstrap for other 
partitions.</p>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="err">import</span> <span 
class="err">org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.{DataSourceWriteOptions,</span> <span 
class="err">HoodieDataSourceHelpers}</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.config.{HoodieBootstrapConfig,</span> <span 
class="err">HoodieWriteConfig}</span>
+<span class="err">import</span> <span 
class="err">org.apache.hudi.keygen.SimpleKeyGenerator</span>
+<span class="err">import</span> <span 
class="err">org.apache.spark.sql.SaveMode</span>
+ 
+<span class="err">val</span> <span class="py">bootstrapDF</span> <span 
class="p">=</span> <span class="s">spark.emptyDataFrame</span>
+<span class="err">bootstrapDF.write</span>
+      <span class="err">.format("hudi")</span>
+      <span class="err">.option(HoodieWriteConfig.TABLE_NAME,</span> <span 
class="err">"hoodie_test")</span>
+      <span 
class="err">.option(DataSourceWriteOptions.OPERATION_OPT_KEY,</span> <span 
class="err">DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)</span>
+      <span 
class="err">.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,</span> 
<span class="err">"_row_key")</span>
+      <span 
class="err">.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,</span> 
<span class="err">"datestr")</span>
+      <span 
class="err">.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY,</span> 
<span class="err">"timestamp")</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP,</span> 
<span class="err">srcPath)</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS,</span> <span 
class="err">classOf[SimpleKeyGenerator].getName)</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR,</span> <span 
class="err">classOf[BootstrapRegexModeSelector].getName)</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX,</span> 
<span class="err">"2020/08/2[0-9]")</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.BOOTSTRAP_MODE_SELECTOR_REGEX_MODE,</span>
 <span class="err">"METADATA_ONLY")</span>
+      <span 
class="err">.option(HoodieBootstrapConfig.FULL_BOOTSTRAP_INPUT_PROVIDER,</span> 
<span class="err">classOf[SparkParquetBootstrapDataProvider].getName)</span>
+      <span class="err">.mode(SaveMode.Overwrite)</span>
+      <span class="err">.save(basePath)</span>
+</code></pre></div></div>
+
+<h2 id="hoodie-deltastreamer">Hoodie DeltaStreamer:</h2>
+
+<p>Hoodie Deltastreamer allows bootstrap to be performed using –run-bootstrap 
command line option.</p>
+
+<p>If you are planning to use delta-streamer after the initial boostrap to 
incrementally ingest data to the new hudi dataset, you need to pass either 
–checkpoint or –initial-checkpoint-provider to set the initial checkpoint for 
the deltastreamer.</p>
+
+<p>Here is an example for running METADATA_ONLY bootstrap using Delta 
Streamer.</p>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="err">spark-submit</span> <span 
class="err">--package</span> <span class="py">org.apache.hudi</span><span 
class="p">:</span><span class="s">hudi-spark-bundle_2.11:0.6.0</span>
+<span class="err">--conf</span> <span class="err">'</span><span 
class="py">spark.serializer</span><span class="p">=</span><span 
class="s">org.apache.spark.serializer.KryoSerializer' </span><span class="se">\
+</span><span class="s">--class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  </span><span 
class="se">\
+</span><span class="s">--run-bootstrap </span><span class="se">\
+</span><span class="s">--target-base-path &lt;Hudi_Base_Path&gt; </span><span 
class="se">\
+</span><span class="s">--target-table &lt;Hudi_Table_Name&gt; </span><span 
class="se">\
+</span><span class="s">--props &lt;props_file&gt; </span><span class="se">\
+</span><span class="s">--checkpoint 
&lt;initial_checkpoint_if_you_are_going_to_use_deltastreamer_to_incrementally_ingest&gt;
 </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.base.path=&lt;Parquet_Source_base_Path&gt; </span><span 
class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.datasource.write.recordkey.field=_row_key </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.datasource.write.partitionpath.field=datestr </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator</span>
+</code></pre></div></div>
+
+<div class="language-properties highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="err">spark-submit</span> <span 
class="err">--package</span> <span class="py">org.apache.hudi</span><span 
class="p">:</span><span class="s">hudi-spark-bundle_2.11:0.6.0</span>
+<span class="err">--conf</span> <span class="err">'</span><span 
class="py">spark.serializer</span><span class="p">=</span><span 
class="s">org.apache.spark.serializer.KryoSerializer' </span><span class="se">\
+</span><span class="s">--class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  </span><span 
class="se">\
+</span><span class="s">--run-bootstrap </span><span class="se">\
+</span><span class="s">--target-base-path &lt;Hudi_Base_Path&gt; </span><span 
class="se">\
+</span><span class="s">--target-table &lt;Hudi_Table_Name&gt; </span><span 
class="se">\
+</span><span class="s">--props &lt;props_file&gt; </span><span class="se">\
+</span><span class="s">--checkpoint 
&lt;initial_checkpoint_if_you_are_going_to_use_deltastreamer_to_incrementally_ingest&gt;
 </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.base.path=&lt;Parquet_Source_base_Path&gt; </span><span 
class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.datasource.write.recordkey.field=_row_key </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.datasource.write.partitionpath.field=datestr </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator 
</span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
 </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
 </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.mode.selector.regex="2020/08/2[0-9]" </span><span class="se">\
+</span><span class="s">--hoodie-conf 
hoodie.bootstrap.mode.selector.regex.mode=METADATA_ONLY</span>
+</code></pre></div></div>
+
+<h2 id="known-caveats">Known Caveats</h2>
+<ol>
+  <li>Need proper defaults for the bootstrap config : 
hoodie.bootstrap.full.input.provider. Here is the <a 
href="https://issues.apache.org/jira/browse/HUDI-1213";>ticket</a></li>
+  <li>DeltaStreamer manages checkpoints inside hoodie commit files and expects 
checkpoints in previously committed metadata. Users are expected to pass 
checkpoint or initial checkpoint provider when performing bootstrap through 
deltastreamer. Such support is not present when doing bootstrap using Spark 
Datasource. Here is the <a 
href="https://issues.apache.org/jira/browse/HUDI-1214";>ticket</a>.</li>
+</ol>
+
+      </section>
+
+      <a href="#masthead__inner-wrap" class="back-to-top">Back to top 
&uarr;</a>
+
+
+      
+
+    </div>
+
+  </article>
+
+</div>
+
+    </div>
+
+    <div class="page__footer">
+      <footer>
+        
+<div class="row">
+  <div class="col-lg-12 footer">
+    <p>
+      <table class="table-apache-info">
+        <tr>
+          <td>
+            <a class="footer-link-img" href="https://apache.org";>
+              <img width="250px" src="/assets/images/asf_logo.svg" alt="The 
Apache Software Foundation">
+            </a>
+          </td>
+          <td>
+            <a style="float: right" 
href="https://www.apache.org/events/current-event.html";>
+              <img 
src="https://www.apache.org/events/current-event-234x60.png"; />
+            </a>
+          </td>
+        </tr>
+      </table>
+    </p>
+    <p>
+      <a href="https://www.apache.org/licenses/";>License</a> | <a 
href="https://www.apache.org/security/";>Security</a> | <a 
href="https://www.apache.org/foundation/thanks.html";>Thanks</a> | <a 
href="https://www.apache.org/foundation/sponsorship.html";>Sponsorship</a>
+    </p>
+    <p>
+      Copyright &copy; <span id="copyright-year">2019</span> <a 
href="https://apache.org";>The Apache Software Foundation</a>, Licensed under 
the <a href="https://www.apache.org/licenses/LICENSE-2.0";> Apache License, 
Version 2.0</a>.
+      Hudi, Apache and the Apache feather logo are trademarks of The Apache 
Software Foundation. <a href="/docs/privacy">Privacy Policy</a>
+    </p>
+  </div>
+</div>
+      </footer>
+    </div>
+
+
+  </body>
+</html>
\ No newline at end of file
diff --git a/content/cn/activity.html b/content/cn/activity.html
index 69cfba7..2b3a5f0 100644
--- a/content/cn/activity.html
+++ b/content/cn/activity.html
@@ -191,6 +191,54 @@
     
     <h2 class="archive__item-title" itemprop="headline">
       
+        <a href="/blog/async-compaction-deployment-model/" 
rel="permalink">Async Compaction Deployment Models
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-21">August 21, 
2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">Mechanisms for 
executing compaction jobs in Hudi asynchronously
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
+        <a href="/blog/efficient-migration-of-large-parquet-tables/" 
rel="permalink">Efficient Migration of Large Parquet Tables to Apache Hudi
+</a>
+      
+    </h2>
+    <!-- Look the author details up from the site config. -->
+    
+    <!-- Output author details if some exist. -->
+    <div class="archive__item-meta"><a 
href="https://cwiki.apache.org/confluence/display/~vbalaji";>Balaji 
Varadarajan</a> posted on <time datetime="2020-08-20">August 20, 
2020</time></div>
+ 
+    <p class="archive__item-excerpt" itemprop="description">Migrating a large 
parquet table to Apache Hudi without having to rewrite the entire dataset.
+</p>
+  </article>
+</div>
+
+        
+        
+
+
+
+<div class="list__item">
+  <article class="archive__item" itemscope 
itemtype="https://schema.org/CreativeWork";>
+    
+    <h2 class="archive__item-title" itemprop="headline">
+      
         <a href="/blog/hudi-incremental-processing-on-data-lakes/" 
rel="permalink">Incremental Processing on the Data Lake
 </a>
       
diff --git a/content/sitemap.xml b/content/sitemap.xml
index b3e6d0e..6be6803 100644
--- a/content/sitemap.xml
+++ b/content/sitemap.xml
@@ -933,6 +933,14 @@
 <lastmod>2020-08-18T00:00:00-04:00</lastmod>
 </url>
 <url>
+<loc>https://hudi.apache.org/blog/efficient-migration-of-large-parquet-tables/</loc>
+<lastmod>2020-08-20T00:00:00-04:00</lastmod>
+</url>
+<url>
+<loc>https://hudi.apache.org/blog/async-compaction-deployment-model/</loc>
+<lastmod>2020-08-21T00:00:00-04:00</lastmod>
+</url>
+<url>
 <loc>https://hudi.apache.org/cn/activity</loc>
 <lastmod>2019-12-30T14:59:57-05:00</lastmod>
 </url>

[hudi] branch asf-site updated: Travis CI build asf-site

Reply via email to