This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new e2d4a2b9a08 [HUDI-7403][DOCS] Support Filter/Transformer to Hudi Exporter Utility (#11549) e2d4a2b9a08 is described below commit e2d4a2b9a0822dec205cf6de407e40c646745f96 Author: Vova Kolmakov <wombatu...@gmail.com> AuthorDate: Tue Jul 2 07:28:01 2024 +0700 [HUDI-7403][DOCS] Support Filter/Transformer to Hudi Exporter Utility (#11549) Co-authored-by: Vova Kolmakov <kolmakov.vladi...@huawei-partners.com> --- website/docs/snapshot_exporter.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/website/docs/snapshot_exporter.md b/website/docs/snapshot_exporter.md index aee29e3c1cc..07986d0bb8b 100644 --- a/website/docs/snapshot_exporter.md +++ b/website/docs/snapshot_exporter.md @@ -20,6 +20,9 @@ query, perform any repartitioning if required and will write the data as Hudi, p |--output-format|Output format for the exported dataset; accept these values: json,parquet,hudi|required|| |--output-partition-field|A field to be used by Spark repartitioning|optional|Ignored when "Hudi" or when --output-partitioner is specified.The output dataset's default partition field will inherent from the source Hudi dataset.| |--output-partitioner|A class to facilitate custom repartitioning|optional|Ignored when using output-format "Hudi"| +|--transformer-class|A subclass of org.apache.hudi.utilities.transform.Transformer. Allows transforming raw source Dataset to a target Dataset (conforming to target schema) before writing.|optional|Ignored when using output-format "Hudi". Available transformers: org.apache.hudi.utilities.transform.SqlQueryBasedTransformer, org.apache.hudi.utilities.transform.SqlFileBasedTransformer, org.apache.hudi.utilities.transform.FlatteningTransformer, org.apache.hudi.utilities.transform.AWSDmsTrans [...] +|--transformer-sql|sql-query template be used to transform the source before writing. The query should reference the source as a table named "\<SRC\>".|optional|Is required for SqlQueryBasedTransformer transformer class, ignored in other cases| +|--transformer-sql|File with a SQL query to be executed during write. The query should reference the source as a table named "\<SRC\>".|optional|Is required for SqlFileBasedTransformer, ignored in other cases| ## Examples @@ -51,6 +54,23 @@ spark-submit \ --output-format "json" # or "parquet" ``` +### Export to json or parquet dataset with transformation/filtering +The Exporter supports custom transformation/filtering on records before writing to json or parquet dataset. This is done by supplying +implementation of `org.apache.hudi.utilities.transform.Transformer` via `--transformer-class` option. + +```bash +spark-submit \ + --jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \ + --deploy-mode "client" \ + --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \ + packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \ + --source-base-path "/tmp/" \ + --target-output-path "/tmp/exported/json/" \ + --transformer-class "org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \ + --transformer-sql "SELECT substr(rider,1,10) as rider, trip_type as tripType FROM <SRC> WHERE trip_type = 'BLACK' LIMIT 10" \ + --output-format "json" # or "parquet" +``` + ### Re-partitioning When exporting to a different format, the Exporter takes the `--output-partition-field` parameter to do some custom re-partitioning. Note: All `_hoodie_*` metadata fields will be stripped during export, so make sure to use an existing non-metadata field as the output partitions.