soumilshah1995 commented on issue #10165: URL: https://github.com/apache/hudi/issues/10165#issuecomment-1828865064
# Test Passed Ran Delta Streamer in CONT Mode ``` spark-submit \ --class org.apache.hudi.utilities.streamer.HoodieStreamer \ --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \ --properties-file spark-config.properties \ --master 'local[*]' \ --executor-memory 1g \ jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \ --continuous \ --source-limit 4000000 \ --min-sync-interval-seconds 20 \ --table-type COPY_ON_WRITE \ --op UPSERT \ --source-ordering-field ts \ --source-class org.apache.hudi.utilities.sources.CsvDFSSource \ --target-base-path file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders \ --target-table orders \ --props hudi_tbl.props ``` # CONF ``` hoodie.datasource.write.recordkey.field=order_id hoodie.datasource.write.partitionpath.field=order_date hoodie.streamer.source.dfs.root=file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders hoodie.datasource.write.precombine.field=ts hoodie.clustering.inline=false hoodie.clustering.async.enabled=true hoodie.write.concurrency.mode=optimistic_concurrency_control hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider hoodie.deltastreamer.csv.header=true hoodie.deltastreamer.csv.sep=\t ``` # Lets See Commits and partitions Path ``` +----------------------+-------------------+ |_hoodie_partition_path|_hoodie_commit_time| +----------------------+-------------------+ | 2023-10-30| 20231127192056631| | 2023-11-17| 20231127192036634| | 2023-11-13| 20231127192036634| | 2023-10-30| 20231127192036634| | 2023-11-13| 20231127192056631| | 2023-11-17| 20231127192056631| | 2023-11-01| 20231127192036634| | 2023-11-10| 20231127192036634| | 2023-11-10| 20231127192056631| | 2023-11-05| 20231127192056631| +----------------------+-------------------+ only showing top 10 rows +---------+----------------+-----+-------------------+ |timestamp|input_group_size|state|involved_partitions| +---------+----------------+-----+-------------------+ +---------+----------------+-----+-------------------+ ``` # lets Run Async clustering to cluster only 2023-10 Pattern <img width="1649" alt="Screenshot 2023-11-27 at 7 24 58 PM" src="https://github.com/apache/hudi/assets/39345855/c87b87e1-3511-4ddf-b040-aa70654270d3"> HoodieClusteringJob: Clustering with basePath: file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders, tableName: orders, runningMode: scheduleAndExecute success # Great ``` +----------------------+-------------------+ |_hoodie_partition_path|_hoodie_commit_time| +----------------------+-------------------+ | 2023-10-30| 20231127192056631| | 2023-11-17| 20231127192036634| | 2023-11-13| 20231127192036634| | 2023-10-30| 20231127192036634| | 2023-11-13| 20231127192056631| | 2023-11-17| 20231127192056631| | 2023-11-01| 20231127192036634| | 2023-11-10| 20231127192036634| | 2023-11-10| 20231127192056631| | 2023-11-05| 20231127192056631| +----------------------+-------------------+ only showing top 10 rows +-----------------+----------------+---------+-------------------------------------------+ |timestamp |input_group_size|state |involved_partitions | +-----------------+----------------+---------+-------------------------------------------+ |20231127192505457|4 |COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31| +-----------------+----------------+---------+-------------------------------------------+ ``` # Running DeltaStreamer Again and then running Async Clustering INFO HoodieClusteringJob: Clustering with basePath: file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders, tableName: orders, runningMode: scheduleAndExecute success # Great ``` +----------------------+-------------------+ |_hoodie_partition_path|_hoodie_commit_time| +----------------------+-------------------+ | 2023-10-30| 20231127192056631| | 2023-11-17| 20231127192036634| | 2023-11-17| 20231127192650158| | 2023-11-17| 20231127192630161| | 2023-11-13| 20231127192036634| | 2023-10-30| 20231127192036634| | 2023-10-30| 20231127192630161| | 2023-11-13| 20231127192056631| | 2023-11-13| 20231127192610153| | 2023-10-30| 20231127192610153| +----------------------+-------------------+ only showing top 10 rows +-----------------+----------------+---------+-------------------------------------------+ |timestamp |input_group_size|state |involved_partitions | +-----------------+----------------+---------+-------------------------------------------+ |20231127192719681|4 |COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31| |20231127192505457|4 |COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31| +-----------------+----------------+---------+-------------------------------------------+ ``` # Running Deal Streamer Again Third Time and then clustering HoodieClusteringJob: Clustering with basePath: file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders, tableName: orders, runningMode: scheduleAndExecute success ``` +----------------------+-------------------+ |_hoodie_partition_path|_hoodie_commit_time| +----------------------+-------------------+ | 2023-11-17| 20231127192036634| | 2023-11-17| 20231127192850185| | 2023-11-17| 20231127192650158| | 2023-11-17| 20231127192630161| | 2023-11-13| 20231127192036634| | 2023-11-13| 20231127192056631| | 2023-11-13| 20231127192610153| | 2023-11-13| 20231127192850185| | 2023-11-05| 20231127192850185| | 2023-11-13| 20231127192910187| +----------------------+-------------------+ only showing top 10 rows +-----------------+----------------+---------+-------------------------------------------+ |timestamp |input_group_size|state |involved_partitions | +-----------------+----------------+---------+-------------------------------------------+ |20231127192929734|4 |COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31| |20231127192719681|4 |COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31| |20231127192505457|4 |COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31| +-----------------+----------------+---------+-------------------------------------------+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org