This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new bdad1bf [HUDI-766]: added section for HoodieMultiTableDeltaStreamer (#1822) bdad1bf is described below commit bdad1bf38190d8f21efde30e549c173b5b9bf115 Author: Pratyaksh Sharma <pratyaks...@gmail.com> AuthorDate: Thu Aug 13 11:59:38 2020 +0530 [HUDI-766]: added section for HoodieMultiTableDeltaStreamer (#1822) * [HUDI-766]: added section for HoodieMultiTableDeltaStreamer * [HUDI-766]: small changes * [HUDI-766]: addressed code review comments --- docs/_docs/2_2_writing_data.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md index 6962563..43fc046 100644 --- a/docs/_docs/2_2_writing_data.md +++ b/docs/_docs/2_2_writing_data.md @@ -174,6 +174,42 @@ and then ingest it as follows. In some cases, you may want to migrate your existing table into Hudi beforehand. Please refer to [migration guide](/docs/migration_guide.html). +## MultiTableDeltaStreamer + +`HoodieMultiTableDeltaStreamer`, a wrapper on top of `HoodieDeltaStreamer`, enables one to ingest multiple tables at a single go into hudi datasets. Currently it only supports sequential processing of tables to be ingested and COPY_ON_WRITE storage type. The command line options for `HoodieMultiTableDeltaStreamer` are pretty much similar to `HoodieDeltaStreamer` with the only exception that you are required to provide table wise configs in separate files in a dedicated config folder. The [...] + +```java + * --config-folder + the path to the folder which contains all the table wise config files + --base-path-prefix + this is added to enable users to create all the hudi datasets for related tables under one path in FS. The datasets are then created under the path - <base_path_prefix>/<database>/<table_to_be_ingested>. However you can override the paths for every table by setting the property hoodie.deltastreamer.ingestion.targetBasePath +``` + +The following properties are needed to be set properly to ingest data using `HoodieMultiTableDeltaStreamer`. + +```java +hoodie.deltastreamer.ingestion.tablesToBeIngested + comma separated names of tables to be ingested in the format <database>.<table>, for example db1.table1,db1.table2 +hoodie.deltastreamer.ingestion.targetBasePath + if you wish to ingest a particular table in a separate path, you can mention that path here +hoodie.deltastreamer.ingestion.<database>.<table>.configFile + path to the config file in dedicated config folder which contains table overridden properties for the particular table to be ingested. +``` + +Sample config files for table wise overridden properties can be found under `hudi-utilities/src/test/resources/delta-streamer-config`. The command to run `HoodieMultiTableDeltaStreamer` is also similar to how you run `HoodieDeltaStreamer`. + +```java +[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \ + --props file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties \ + --config-folder file://tmp/hudi-ingestion-config \ + --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \ + --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \ + --source-ordering-field impresssiontime \ + --base-path-prefix file:\/\/\/tmp/hudi-deltastreamer-op \ + --target-table uber.impressions \ + --op BULK_INSERT +``` + ## Datasource Writer The `hudi-spark` module offers the DataSource API to write (and read) a Spark DataFrame into a Hudi table. There are a number of options available: