homatthew commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r972323434
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
LOGGER.info("ApplicationMaster registration response: " + response);
this.maxResourceCapacity =
Optional.of(response.getMaximumResourceCapability());
+ // All previous helix instances should be purged on startup. Gobblin task
runners are stateless from helix
+ // perspective because all important state is persisted separately in
Workunit State Store or Watermark store.
+ // Offline duration of 0 means any offline instance should be purged
(Note: there aren't any online instances
+ // when this code runs, this is during startup before any containers are
allocated).
+ LOGGER.info("Purging offline helix instances before allocating containers
for helixClusterName={}, connectionString={}", helixManager.getClusterName(),
helixManager.getMetadataStoreConnectionString());
+ long offlineDuration = 0;
+ this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(),
offlineDuration);
Review Comment:
Well kind of. As mentioned
> But adding a timeout adds a potential bug / risk where we timeout and then
helix starts purging instances while we are allocating new instances (bad bug
with nondeterministic behavior)
We put ourselves at risk of having a live instances without an INSTANCE
config. That would essentially bork the live instance and I don't know that the
system would auto recover. Although I do think we should send a GTE if this
call is taking too long
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]