soumilshah1995 commented on issue #10165:
URL: https://github.com/apache/hudi/issues/10165#issuecomment-1828865064

   # Test  Passed 
   
   Ran Delta Streamer in CONT Mode 
   ```
   
   spark-submit \
       --class org.apache.hudi.utilities.streamer.HoodieStreamer \
       --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \
       --properties-file spark-config.properties \
       --master 'local[*]' \
       --executor-memory 1g \
       jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
       --continuous \
       --source-limit 4000000 \
       --min-sync-interval-seconds 20 \
       --table-type COPY_ON_WRITE \
       --op UPSERT \
       --source-ordering-field ts \
       --source-class org.apache.hudi.utilities.sources.CsvDFSSource \
       --target-base-path 
file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders
 \
       --target-table orders \
       --props hudi_tbl.props
   ```
   
   # CONF
   ```
   hoodie.datasource.write.recordkey.field=order_id
   hoodie.datasource.write.partitionpath.field=order_date
   
hoodie.streamer.source.dfs.root=file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/sampledata/orders
   hoodie.datasource.write.precombine.field=ts
   hoodie.clustering.inline=false
   hoodie.clustering.async.enabled=true
   hoodie.write.concurrency.mode=optimistic_concurrency_control
   
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
   hoodie.deltastreamer.csv.header=true
   hoodie.deltastreamer.csv.sep=\t
   
   ```
   
   
   # Lets See Commits and partitions Path 
   
   ```
   +----------------------+-------------------+                                 
   
   |_hoodie_partition_path|_hoodie_commit_time|
   +----------------------+-------------------+
   |            2023-10-30|  20231127192056631|
   |            2023-11-17|  20231127192036634|
   |            2023-11-13|  20231127192036634|
   |            2023-10-30|  20231127192036634|
   |            2023-11-13|  20231127192056631|
   |            2023-11-17|  20231127192056631|
   |            2023-11-01|  20231127192036634|
   |            2023-11-10|  20231127192036634|
   |            2023-11-10|  20231127192056631|
   |            2023-11-05|  20231127192056631|
   +----------------------+-------------------+
   only showing top 10 rows
   
   +---------+----------------+-----+-------------------+
   |timestamp|input_group_size|state|involved_partitions|
   +---------+----------------+-----+-------------------+
   +---------+----------------+-----+-------------------+
   ```
   
   # lets Run Async clustering to cluster only 2023-10 Pattern 
   
   <img width="1649" alt="Screenshot 2023-11-27 at 7 24 58 PM" 
src="https://github.com/apache/hudi/assets/39345855/c87b87e1-3511-4ddf-b040-aa70654270d3";>
   
   HoodieClusteringJob: Clustering with basePath: 
file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders,
 tableName: orders, runningMode: scheduleAndExecute success
   
   # Great 
   ```
   +----------------------+-------------------+
   |_hoodie_partition_path|_hoodie_commit_time|
   +----------------------+-------------------+
   |            2023-10-30|  20231127192056631|
   |            2023-11-17|  20231127192036634|
   |            2023-11-13|  20231127192036634|
   |            2023-10-30|  20231127192036634|
   |            2023-11-13|  20231127192056631|
   |            2023-11-17|  20231127192056631|
   |            2023-11-01|  20231127192036634|
   |            2023-11-10|  20231127192036634|
   |            2023-11-10|  20231127192056631|
   |            2023-11-05|  20231127192056631|
   +----------------------+-------------------+
   only showing top 10 rows
   
   
+-----------------+----------------+---------+-------------------------------------------+
   |timestamp        |input_group_size|state    |involved_partitions            
            |
   
+-----------------+----------------+---------+-------------------------------------------+
   |20231127192505457|4               
|COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31|
   
+-----------------+----------------+---------+-------------------------------------------+
   ```
   
   # Running DeltaStreamer Again and then running Async Clustering 
   
   INFO HoodieClusteringJob: Clustering with basePath: 
file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders,
 tableName: orders, runningMode: scheduleAndExecute success
   
   # Great 
   ```
   +----------------------+-------------------+
   |_hoodie_partition_path|_hoodie_commit_time|
   +----------------------+-------------------+
   |            2023-10-30|  20231127192056631|
   |            2023-11-17|  20231127192036634|
   |            2023-11-17|  20231127192650158|
   |            2023-11-17|  20231127192630161|
   |            2023-11-13|  20231127192036634|
   |            2023-10-30|  20231127192036634|
   |            2023-10-30|  20231127192630161|
   |            2023-11-13|  20231127192056631|
   |            2023-11-13|  20231127192610153|
   |            2023-10-30|  20231127192610153|
   +----------------------+-------------------+
   only showing top 10 rows
   
   
+-----------------+----------------+---------+-------------------------------------------+
   |timestamp        |input_group_size|state    |involved_partitions            
            |
   
+-----------------+----------------+---------+-------------------------------------------+
   |20231127192719681|4               
|COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31|
   |20231127192505457|4               
|COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31|
   
+-----------------+----------------+---------+-------------------------------------------+
   
   ```
   
   # Running Deal Streamer Again Third Time and then clustering 
   
   HoodieClusteringJob: Clustering with basePath: 
file:///Users/soumilshah/IdeaProjects/SparkProject/apache-hudi-delta-streamer-labs/E5/hudidb/orders,
 tableName: orders, runningMode: scheduleAndExecute success
   
   ```
   +----------------------+-------------------+
   |_hoodie_partition_path|_hoodie_commit_time|
   +----------------------+-------------------+
   |            2023-11-17|  20231127192036634|
   |            2023-11-17|  20231127192850185|
   |            2023-11-17|  20231127192650158|
   |            2023-11-17|  20231127192630161|
   |            2023-11-13|  20231127192036634|
   |            2023-11-13|  20231127192056631|
   |            2023-11-13|  20231127192610153|
   |            2023-11-13|  20231127192850185|
   |            2023-11-05|  20231127192850185|
   |            2023-11-13|  20231127192910187|
   +----------------------+-------------------+
   only showing top 10 rows
   
   
+-----------------+----------------+---------+-------------------------------------------+
   |timestamp        |input_group_size|state    |involved_partitions            
            |
   
+-----------------+----------------+---------+-------------------------------------------+
   |20231127192929734|4               
|COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31|
   |20231127192719681|4               
|COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31|
   |20231127192505457|4               
|COMPLETED|2023-10-28,2023-10-29,2023-10-30,2023-10-31|
   
+-----------------+----------------+---------+-------------------------------------------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to