[ https://issues.apache.org/jira/browse/PIG-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731184#comment-15731184 ]
Daniel Dai commented on PIG-5071: --------------------------------- Yes, that's the best solution if works for you. > MapReduce concurrency Could Be Better > ------------------------------------- > > Key: PIG-5071 > URL: https://issues.apache.org/jira/browse/PIG-5071 > Project: Pig > Issue Type: Wish > Reporter: William Watson > > We have a job that launches, after optimization, about 20 MapReduce jobs. > Some of these are quite long running and while pig does an okay job of > running jobs concurrently, it could do better at least in this very specific > case. > The pig job can be divided up amongst 4 major sections like so: > A1 -> A2 -> A3 -> A4 -> A > B1 -> B2 -> B > C1 -> C2 -> C3 -> C > D1 -> D2 -> D3 -> D4 -> D > and the sections are joined at the end: > A + B -> AB > AB + C -> ABC > ABC + D -> ABCD > In short, if C2 finishes very quickly, C3 won't be started until A2, B2, and > D2 are all also complete. This is a problem if say, D2 takes an hour and > there are unused cluster resources that could be made available to C3 (and by > extension A3 and B3 if their prerequisites also finish before D2). > One possible work around is to scale D2 better, but that's besides the point. > I think pig is capable of knowing that the prerequisites are done for certain > jobs, but since it only kicks off jobs in "phases", it won't kick off jobs as > soon as possible. > I've taken a look at the code and I'm having a hard time working out where > the issue is or else I would be glad to contribute a patch. > Is this a desirable feature and is this directly controlled by pig? If so, > could someone help point me in the right direction so I can contribute a > patch? > Note: We can change this from a "wish" to an "improvement" if this feature is > desired... -- This message was sent by Atlassian JIRA (v6.3.4#6332)