[ https://issues.apache.org/jira/browse/YARN-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Billie Rinaldi resolved YARN-6136. ---------------------------------- Resolution: Invalid This issue is caused by YARN-2571, which has not been committed and is resolved as won't fix. > YARN registry service should avoid scanning whole ZK tree for every > container/application finish > ------------------------------------------------------------------------------------------------ > > Key: YARN-6136 > URL: https://issues.apache.org/jira/browse/YARN-6136 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, resourcemanager > Reporter: Wangda Tan > Assignee: Wangda Tan > Priority: Critical > > In existing registry service implementation, purge operation triggered by > container finish event: > {code} > public void onContainerFinished(ContainerId id) throws IOException { > LOG.info("Container {} finished, purging container-level records", > id); > purgeRecordsAsync("/", > id.toString(), > PersistencePolicies.CONTAINER); > } > {code} > Since this happens on every container finish, so it essentially scans all (or > almost) ZK node from the root. > We have a cluster which have hundreds of ZK nodes for service registry, and > have 20K+ ZK nodes for other purposes. The existing implementation could > generate massive ZK operations and internal Java objects (RegistryPathStatus) > as well. The RM becomes very unstable when there're batch container finish > events because of full GC pause and ZK connection failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org