[ https://issues.apache.org/jira/browse/YARN-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15438921#comment-15438921 ]
Jason Lowe commented on YARN-5547: ---------------------------------- Thanks for the patch! What I meant about the leak is a scenario like this: # NM is running version V which introduced a new key K that is associated with containers. # A container is running which causes K to be written to the state store # User does a rolling downgrade to V-1. The code ignores unrecognized key K. # The container completes and the container is removed from the state store. This only removes the container keys version V knows about, and K is not one of those keys. # At this point K has been leaked in the state store. # That leak will be permanent until a rolling upgrade to >= V. Even then K might not be cleaned up since all the other container state has been removed, probably interfering with the typical recovery flow for that key type. There are a couple of risks when cleaning up unrecognized keys. The old version may be removing the key too early in the lifecycle of that state such that if we do a rolling upgrade back to the version that works with those keys we've incorrectly destroyed the state. We probably need to think more about the ramifications of cleaning unrecognized keys and when we should or shouldn't do so. Appreciate any thoughts on this. The other risk is that doing this cleaning will add a place where the NM will read the state store as it scans for keys to remove, and previously it only ever wrote to the store after the initial recover on startup. Writes to leveldb are typically very fast, whereas reads could be much slower depending upon how much the database needs to be compacted and how many blocks are involved in the scan. This is likely a minor concern especially with the recent periodic full compaction to the store, but it will impact state store performance to some degree. As for the patch the changes will make the NM more tolerant of new container keys, but there are other places where unexpected keys will break the state store recovery. loadResourceTrackerState and loadUserLocalizedResources are some other places that should be updated and there are similar questions there as to what should be done about cleanup of unexpected keys. > NMLeveldbStateStore should be more tolerant of unknown keys > ----------------------------------------------------------- > > Key: YARN-5547 > URL: https://issues.apache.org/jira/browse/YARN-5547 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.6.0 > Reporter: Jason Lowe > Assignee: Ajith S > Attachments: YARN-5547.01.patch > > > Whenever new keys are added to the NM state store it will break rolling > downgrades because the code will throw if it encounters an unrecognized key. > If instead it skipped unrecognized keys it could be simpler to continue > supporting rolling downgrades. We need to define the semantics of > unrecognized keys when containers and apps are cleaned up, e.g.: we may want > to delete all keys underneath an app or container directory when it is being > removed from the state store to prevent leaking unrecognized keys. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org