[ 
https://issues.apache.org/jira/browse/YARN-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15438921#comment-15438921
 ] 

Jason Lowe commented on YARN-5547:
----------------------------------

Thanks for the patch!

What I meant about the leak is a scenario like this:
# NM is running version V which introduced a new key K that is associated with 
containers.
# A container is running which causes K to be written to the state store
# User does a rolling downgrade to V-1.  The code ignores unrecognized key K.
# The container completes and the container is removed from the state store.  
This only removes the container keys version V knows about, and K is not one of 
those keys.
# At this point K has been leaked in the state store.
# That leak will be permanent until a rolling upgrade to >= V.  Even then K 
might not be cleaned up since all the other container state has been removed, 
probably interfering with the typical recovery flow for that key type.

There are a couple of risks when cleaning up unrecognized keys.  The old 
version may be removing the key too early in the lifecycle of that state such 
that if we do a rolling upgrade back to the version that works with those keys 
we've incorrectly destroyed the state.  We probably need to think more about 
the ramifications of cleaning unrecognized keys and when we should or shouldn't 
do so.  Appreciate any thoughts on this.

The other risk is that doing this cleaning will add a place where the NM will 
read the state store as it scans for keys to remove, and previously it only 
ever wrote to the store after the initial recover on startup.  Writes to 
leveldb are typically very fast, whereas reads could be much slower depending 
upon how much the database needs to be compacted and how many blocks are 
involved in the scan.  This is likely a minor concern especially with the 
recent periodic full compaction to the store, but it will impact state store 
performance to some degree.

As for the patch the changes will make the NM more tolerant of new container 
keys, but there are other places where unexpected keys will break the state 
store recovery.  loadResourceTrackerState and loadUserLocalizedResources are 
some other places that should be updated and there are similar questions there 
as to what should be done about cleanup of unexpected keys.


> NMLeveldbStateStore should be more tolerant of unknown keys
> -----------------------------------------------------------
>
>                 Key: YARN-5547
>                 URL: https://issues.apache.org/jira/browse/YARN-5547
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Ajith S
>         Attachments: YARN-5547.01.patch
>
>
> Whenever new keys are added to the NM state store it will break rolling 
> downgrades because the code will throw if it encounters an unrecognized key.  
> If instead it skipped unrecognized keys it could be simpler to continue 
> supporting rolling downgrades.  We need to define the semantics of 
> unrecognized keys when containers and apps are cleaned up, e.g.: we may want 
> to delete all keys underneath an app or container directory when it is being 
> removed from the state store to prevent leaking unrecognized keys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to