[ https://issues.apache.org/jira/browse/HUDI-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380998#comment-17380998 ]
ASF GitHub Bot commented on HUDI-2164: -------------------------------------- zhangyue19921010 commented on a change in pull request #3259: URL: https://github.com/apache/hudi/pull/3259#discussion_r670111402 ########## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java ########## @@ -171,4 +200,38 @@ private int doCluster(JavaSparkContext jsc) throws Exception { return client.scheduleClustering(Option.empty()); } } + + @TestOnly + public int doScheduleAndCluster() throws Exception { + return this.doScheduleAndCluster(jsc); + } + + public int doScheduleAndCluster(JavaSparkContext jsc) throws Exception { + LOG.info("Step 1: Do schedule"); + String schemaStr = getSchemaFromLatestInstant(); + try (SparkRDDWriteClient client = UtilHelpers.createHoodieClient(jsc, cfg.basePath, schemaStr, cfg.parallelism, Option.empty(), props)) { + + Option<String> instantTime; + if (cfg.clusteringInstantTime != null) { + client.scheduleClusteringAtInstant(cfg.clusteringInstantTime, Option.empty()); + instantTime = Option.of(cfg.clusteringInstantTime); + } else { + instantTime = client.scheduleClustering(Option.empty()); + } + + int result = instantTime.isPresent() ? 0 : -1; Review comment: Emmmm, actually, there already has doSchedule() and doCluster() function. But if we let doScheduleAndCluster() use doschedule() and docluster() directly, it will start and stop SparkRDDWriteClient twice which is an expensive action and unnecessary. Maybe let schedule action and cluster action use a common SparkRDDWriteClient is better. For example start and stop Timeline service twice. ``` 21/07/15 11:05:11 INFO EmbeddedTimelineService: Starting Timeline service !! 21/07/15 11:05:11 INFO EmbeddedTimelineService: Overriding hostIp to (localhost) found in spark-conf. It was null 21/07/15 11:05:11 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY 21/07/15 11:05:11 INFO FileSystemViewManager: Creating in-memory based Table View 21/07/15 11:05:11 INFO log: Logging initialized @4500ms to org.apache.hudi.org.eclipse.jetty.util.log.Slf4jLog 21/07/15 11:05:11 INFO Javalin: __ __ _ / /____ _ _ __ ____ _ / /(_)____ __ / // __ `/| | / // __ `// // // __ \ / /_/ // /_/ / | |/ // /_/ // // // / / / \____/ \__,_/ |___/ \__,_//_//_//_/ /_/ https://javalin.io/documentation 21/07/15 11:05:11 INFO Javalin: Starting Javalin ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Build cluster plan and execute this plan at once for HoodieClusteringJob > ------------------------------------------------------------------------ > > Key: HUDI-2164 > URL: https://issues.apache.org/jira/browse/HUDI-2164 > Project: Apache Hudi > Issue Type: Task > Reporter: Yue Zhang > Priority: Major > Labels: pull-request-available > > For now, Hudi can let users submit a HoodieClusteringJob to build a > clustering plan or execute a clustering plan through --schedule or > --instant-time config. > If users want to trigger a clustering job, he has to > # Submit a HoodieClusteringJob to build a clustering job through --schedule > config > # Copy the created clustering Instant time form Log info. > # Submit the HoodieClusteringJob again to execute this created clustering > plan through --instant-time config. > The pain point is that there are too many steps when trigger a clustering and > need to copy and paste the instant time from log file manually so that we > can't make it automatically. > > I just raise a PR to offer a new config named --mode or -m in short > ||--mode||remarks|| > |execute|Execute a cluster plan at given instant which means --instant-time > is needed here. default value. | > |schedule|Make a clustering plan.| > |*scheduleAndExecute*|Make a cluster plan first and execute that plan > immediately| > Now users can use --mode scheduleAndExecute to Build cluster plan and execute > this plan at once using HoodieClusteringJob. > -- This message was sent by Atlassian Jira (v8.3.4#803005)