[jira] [Commented] (HUDI-2164) Build cluster plan and execute this plan at once for HoodieClusteringJob

ASF GitHub Bot (Jira) Wed, 14 Jul 2021 20:49:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380998#comment-17380998
 ]


ASF GitHub Bot commented on HUDI-2164:
--------------------------------------

zhangyue19921010 commented on a change in pull request #3259:
URL: https://github.com/apache/hudi/pull/3259#discussion_r670111402



##########
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
##########
@@ -171,4 +200,38 @@ private int doCluster(JavaSparkContext jsc) throws 
Exception {
       return client.scheduleClustering(Option.empty());
     }
   }
+
+  @TestOnly
+  public int doScheduleAndCluster() throws Exception {
+    return this.doScheduleAndCluster(jsc);
+  }
+
+  public int doScheduleAndCluster(JavaSparkContext jsc) throws Exception {
+    LOG.info("Step 1: Do schedule");
+    String schemaStr = getSchemaFromLatestInstant();
+    try (SparkRDDWriteClient client = UtilHelpers.createHoodieClient(jsc, 
cfg.basePath, schemaStr, cfg.parallelism, Option.empty(), props)) {
+
+      Option<String> instantTime;
+      if (cfg.clusteringInstantTime != null) {
+        client.scheduleClusteringAtInstant(cfg.clusteringInstantTime, 
Option.empty());
+        instantTime = Option.of(cfg.clusteringInstantTime);
+      } else {
+        instantTime = client.scheduleClustering(Option.empty());
+      }
+
+      int result = instantTime.isPresent() ? 0 : -1;

Review comment:
       Emmmm, actually, there already has doSchedule() and doCluster() 
function. But if we let doScheduleAndCluster() use  doschedule() and 
docluster() directly, it will start and stop SparkRDDWriteClient twice which is 
an expensive action and unnecessary. 
   
   Maybe let schedule action and cluster action use a common 
SparkRDDWriteClient is better.
   
   For example start and stop Timeline service twice.
   ```
   21/07/15 11:05:11 INFO EmbeddedTimelineService: Starting Timeline service !!
   21/07/15 11:05:11 INFO EmbeddedTimelineService: Overriding hostIp to 
(localhost) found in spark-conf. It was null
   21/07/15 11:05:11 INFO FileSystemViewManager: Creating View Manager with 
storage type :MEMORY
   21/07/15 11:05:11 INFO FileSystemViewManager: Creating in-memory based Table 
View
   21/07/15 11:05:11 INFO log: Logging initialized @4500ms to 
org.apache.hudi.org.eclipse.jetty.util.log.Slf4jLog
   21/07/15 11:05:11 INFO Javalin: 
              __                      __ _
             / /____ _ _   __ ____ _ / /(_)____
        __  / // __ `/| | / // __ `// // // __ \
       / /_/ // /_/ / | |/ // /_/ // // // / / /
       \____/ \__,_/  |___/ \__,_//_//_//_/ /_/
   
           https://javalin.io/documentation
   
   21/07/15 11:05:11 INFO Javalin: Starting Javalin ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Build cluster plan and execute this plan at once for HoodieClusteringJob
> ------------------------------------------------------------------------
>
>                 Key: HUDI-2164
>                 URL: https://issues.apache.org/jira/browse/HUDI-2164
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Yue Zhang
>            Priority: Major
>              Labels: pull-request-available
>
> For now, Hudi can let users submit a HoodieClusteringJob to build a 
> clustering plan or execute a clustering plan through --schedule or 
> --instant-time config.
> If users want to trigger a clustering job, he has to 
>  # Submit a HoodieClusteringJob to build a clustering job through --schedule 
> config
>  # Copy the created clustering Instant time form Log info.
>  # Submit the HoodieClusteringJob again to execute this created clustering 
> plan through --instant-time config.
> The pain point is that there are too many steps when trigger a clustering and 
> need to copy and paste the instant time from log file manually so that we 
> can't make it automatically.
>  
> I just raise a PR to offer a new config named --mode or -m in short 
> ||--mode||remarks||
> |execute|Execute a cluster plan at given instant which means --instant-time 
> is needed here. default value. |
> |schedule|Make a clustering plan.|
> |*scheduleAndExecute*|Make a cluster plan first and execute that plan 
> immediately|
> Now users can use --mode scheduleAndExecute to Build cluster plan and execute 
> this plan at once using HoodieClusteringJob.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2164) Build cluster plan and execute this plan at once for HoodieClusteringJob

Reply via email to