This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new e2d4a2b9a08 [HUDI-7403][DOCS] Support Filter/Transformer to Hudi 
Exporter Utility (#11549)
e2d4a2b9a08 is described below

commit e2d4a2b9a0822dec205cf6de407e40c646745f96
Author: Vova Kolmakov <wombatu...@gmail.com>
AuthorDate: Tue Jul 2 07:28:01 2024 +0700

    [HUDI-7403][DOCS] Support Filter/Transformer to Hudi Exporter Utility 
(#11549)
    
    Co-authored-by: Vova Kolmakov <kolmakov.vladi...@huawei-partners.com>
---
 website/docs/snapshot_exporter.md | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/website/docs/snapshot_exporter.md 
b/website/docs/snapshot_exporter.md
index aee29e3c1cc..07986d0bb8b 100644
--- a/website/docs/snapshot_exporter.md
+++ b/website/docs/snapshot_exporter.md
@@ -20,6 +20,9 @@ query, perform any repartitioning if required and will write 
the data as Hudi, p
 |--output-format|Output format for the exported dataset; accept these values: 
json,parquet,hudi|required||
 |--output-partition-field|A field to be used by Spark 
repartitioning|optional|Ignored when "Hudi" or when --output-partitioner is 
specified.The output dataset's default partition field will inherent from the 
source Hudi dataset.|
 |--output-partitioner|A class to facilitate custom 
repartitioning|optional|Ignored when using output-format "Hudi"|
+|--transformer-class|A subclass of 
org.apache.hudi.utilities.transform.Transformer. Allows transforming raw source 
Dataset to a target Dataset (conforming to target schema) before 
writing.|optional|Ignored when using output-format "Hudi". Available 
transformers: org.apache.hudi.utilities.transform.SqlQueryBasedTransformer, 
org.apache.hudi.utilities.transform.SqlFileBasedTransformer, 
org.apache.hudi.utilities.transform.FlatteningTransformer, 
org.apache.hudi.utilities.transform.AWSDmsTrans [...]
+|--transformer-sql|sql-query template be used to transform the source before 
writing. The query should reference the source as a table named 
"\<SRC\>".|optional|Is required for SqlQueryBasedTransformer transformer class, 
ignored in other cases|
+|--transformer-sql|File with a SQL query to be executed during write. The 
query should reference the source as a table named "\<SRC\>".|optional|Is 
required for SqlFileBasedTransformer, ignored in other cases|
 
 ## Examples
 
@@ -51,6 +54,23 @@ spark-submit \
   --output-format "json"  # or "parquet"
 ```
 
+### Export to json or parquet dataset with transformation/filtering
+The Exporter supports custom transformation/filtering on records before 
writing to json or parquet dataset. This is done by supplying
+implementation of `org.apache.hudi.utilities.transform.Transformer` via 
`--transformer-class` option.
+
+```bash
+spark-submit \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+  --deploy-mode "client" \
+  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
+      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+  --source-base-path "/tmp/" \
+  --target-output-path "/tmp/exported/json/" \
+  --transformer-class 
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
+  --transformer-sql "SELECT substr(rider,1,10) as rider, trip_type as tripType 
FROM <SRC> WHERE trip_type = 'BLACK' LIMIT 10" \
+  --output-format "json"  # or "parquet"
+```
+
 ### Re-partitioning
 When exporting to a different format, the Exporter takes the 
`--output-partition-field` parameter to do some custom re-partitioning.
 Note: All `_hoodie_*` metadata fields will be stripped during export, so make 
sure to use an existing non-metadata field as the output partitions.

Reply via email to