[ https://issues.apache.org/jira/browse/YARN-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819248#comment-15819248 ]
Jason Lowe commented on YARN-574: --------------------------------- Thanks for picking this up [~ajithshetty]. I took a quick look at the patch. It looks OK at a high level, but there is a race condition in how we're dealing with the thread pool. The code makes the assumption that work submitted to the queue will be picked up instantly by an idle thread in the thread pool. If it's not picked up fast enough then we can end up doing one or more super-quick heartbeats and accidentally queue up more work for the thread pool than we have active threads. That could actually make the localization _slower_ when there are multiple containers for the same job on the same node, since one of the other container localizers that has idle threads cannot work on a resource already handed to another localizer. IMHO we can trivially track the outstanding count ourselves. We simply need to increment an AtomicInteger when we submit the work to the executor, then wrap FSDownload in another Callable that decrements the AtomicInteger when FSDownload returns/throws. Then we can track how many resources are either pending or actively being downloaded without getting bitten by race conditions in the executor implementation. Alternatively the createStatus method already walks the Future objects returned from the executor and we could calculate how many resources are in-progress (i.e.: either pending or actively being downloaded) there. Once there are as many in-progress resources as the configured parallelism then we should avoid making quick heartbeats. > PrivateLocalizer does not support parallel resource download via > ContainerLocalizer > ----------------------------------------------------------------------------------- > > Key: YARN-574 > URL: https://issues.apache.org/jira/browse/YARN-574 > Project: Hadoop YARN > Issue Type: Sub-task > Affects Versions: 2.6.0, 2.8.0, 2.7.1 > Reporter: Omkar Vinit Joshi > Assignee: Ajith S > Attachments: YARN-574.03.patch, YARN-574.04.patch, YARN-574.1.patch, > YARN-574.2.patch > > > At present private resources will be downloaded in parallel only if multiple > containers request the same resource. However otherwise it will be serial. > The protocol between PrivateLocalizer and ContainerLocalizer supports > multiple downloads however it is not used and only one resource is sent for > downloading at a time. > I think we can increase / assure parallelism (even for single container > requesting resource) for private/application resources by making multiple > downloads per ContainerLocalizer. > Total Parallelism before > = number of threads allotted for PublicLocalizer [public resource] + number > of containers[private and application resource] > Total Parallelism after > = number of threads allotted for PublicLocalizer [public resource] + number > of containers * max downloads per container [private and application resource] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org