[ 
https://issues.apache.org/jira/browse/GIRAPH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755667#comment-13755667
 ] 

Eli Reisman commented on GIRAPH-730:
------------------------------------

My question is this: the AM is running as a single thread, and then makes a 
request for all the containers it needs in one lump. In my tests, what happened 
was this callback (by the RM giving the local AM all the containers it was 
asked for) returned the whole bunch of containers at once, but this call is 
made asynchronously.

However, once the callback produced all the requested containers (always in a 
single asynchronous callback), the same single AM thread is what iterated 
through the collection of containers, one at a time, populating them with 
metadata and the resource map in buildContainerLaunchContext. So there was no 
concurrency issue.

BUT, I think now that you are running on a larger cluster and asking for more 
containers, they are being returned in smaller groups. Perhaps you ask for 500 
workers and instead you get two asynchronous callbacks from the RM, one with 
200 and one with 300 containers, and both of _those_ asyncronous calls 
returning the groups of containers are now racing into 
buildContainerLaunchContext (etc) and this is where the concurrency issue 
arises?

YARN certainly does not guarantee you can get back all the containers you ask 
for at once, although in my tests I didn't see any behavior but this. If at 
your scale you are seeing this problem, we need to address it. Good catch!

If this is what is happening (you have logged one AM ask for X containers 
resulting in more than one asynchronous callback returning A, B, and C # of 
containers, where A+B+C = X) then we need to fix this.

But, I do think we should not risk going with a partial solution. If what 
you're describing and what I am describing above match up, we really should 
just eliminate this risk now by protecting buildContainerContext, or concurrent 
attempts to populate the launch container contexts with id's etc could 
overwrite each other, and containers could be lost on the AM side this way.

It doesn't mean we have to just slap a "synchronized" block on to 
buildContainerLaunchContext, maybe something more subtle could work. But we 
probably should address the problem so that all the risk is gone.

What do you think? Maybe try another patch that addresses all possible race 
risks here? Also, if the race you're seeing is not as I have described here, 
please let me know what the real concern is, maybe I missed your point.

Thanks, great work!
                
> GiraphApplicationMaster race condition in resource loading
> ----------------------------------------------------------
>
>                 Key: GIRAPH-730
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-730
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>         Environment: Giraph with Yarn
>            Reporter: Chuan Lei
>            Assignee: Chuan Lei
>         Attachments: GIRAPH-730.v1.patch
>
>
> In GiraphApplicationMaster.java, getTaskResourceMap function is not 
> multi-thread safe, which causes the application master fail to distribute the 
> resources (jar, configuration file, etc.) to each container.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to