William Watson created PIG-5071:
-----------------------------------

             Summary: MapReduce concurrency Could Be Better
                 Key: PIG-5071
                 URL: https://issues.apache.org/jira/browse/PIG-5071
             Project: Pig
          Issue Type: Wish
            Reporter: William Watson


We have a job that launches, after optimization, about 20 MapReduce jobs. Some 
of these are quite long running and while pig does an okay job of running jobs 
concurrently, it could do better at least in this very specific case.

The pig job can be divided up amongst 4 major sections like so:

A1 -> A2 -> A3 -> A4 -> A
B1 -> B2 -> B
C1 -> C2 -> C3 -> C
D1 -> D2 -> D3 -> D4 -> D

and the sections are joined at the end:
A + B -> AB
AB + C -> ABC
ABC + D -> ABCD

In short, if C2 finishes very quickly, C3 won't be started until A2, B2, and D2 
are all also complete. This is a problem if say, D2 takes an hour and there are 
unused cluster resources that could be made available to C3 (and by extension 
A3 and B3 if their prerequisites also finish before D2).

One possible work around is to scale D2 better, but that's besides the point. I 
think pig is capable of knowing that the prerequisites are done for certain 
jobs, but since it only kicks off jobs in "phases", it won't kick off jobs as 
soon as possible.

I've taken a look at the code and I'm having a hard time working out where the 
issue is or else I would be glad to contribute a patch. 

Is this a desirable feature and is this directly controlled by pig? If so, 
could someone help point me in the right direction so I can contribute a patch?

Note: We can change this from a "wish" to an "improvement" if this feature is 
desired...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to