[
https://issues.apache.org/jira/browse/GIRAPH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755667#comment-13755667
]
Eli Reisman commented on GIRAPH-730:
------------------------------------
My question is this: the AM is running as a single thread, and then makes a
request for all the containers it needs in one lump. In my tests, what happened
was this callback (by the RM giving the local AM all the containers it was
asked for) returned the whole bunch of containers at once, but this call is
made asynchronously.
However, once the callback produced all the requested containers (always in a
single asynchronous callback), the same single AM thread is what iterated
through the collection of containers, one at a time, populating them with
metadata and the resource map in buildContainerLaunchContext. So there was no
concurrency issue.
BUT, I think now that you are running on a larger cluster and asking for more
containers, they are being returned in smaller groups. Perhaps you ask for 500
workers and instead you get two asynchronous callbacks from the RM, one with
200 and one with 300 containers, and both of _those_ asyncronous calls
returning the groups of containers are now racing into
buildContainerLaunchContext (etc) and this is where the concurrency issue
arises?
YARN certainly does not guarantee you can get back all the containers you ask
for at once, although in my tests I didn't see any behavior but this. If at
your scale you are seeing this problem, we need to address it. Good catch!
If this is what is happening (you have logged one AM ask for X containers
resulting in more than one asynchronous callback returning A, B, and C # of
containers, where A+B+C = X) then we need to fix this.
But, I do think we should not risk going with a partial solution. If what
you're describing and what I am describing above match up, we really should
just eliminate this risk now by protecting buildContainerContext, or concurrent
attempts to populate the launch container contexts with id's etc could
overwrite each other, and containers could be lost on the AM side this way.
It doesn't mean we have to just slap a "synchronized" block on to
buildContainerLaunchContext, maybe something more subtle could work. But we
probably should address the problem so that all the risk is gone.
What do you think? Maybe try another patch that addresses all possible race
risks here? Also, if the race you're seeing is not as I have described here,
please let me know what the real concern is, maybe I missed your point.
Thanks, great work!
> GiraphApplicationMaster race condition in resource loading
> ----------------------------------------------------------
>
> Key: GIRAPH-730
> URL: https://issues.apache.org/jira/browse/GIRAPH-730
> Project: Giraph
> Issue Type: Bug
> Affects Versions: 1.0.0
> Environment: Giraph with Yarn
> Reporter: Chuan Lei
> Assignee: Chuan Lei
> Attachments: GIRAPH-730.v1.patch
>
>
> In GiraphApplicationMaster.java, getTaskResourceMap function is not
> multi-thread safe, which causes the application master fail to distribute the
> resources (jar, configuration file, etc.) to each container.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira