Improve Pig's progress indicator by keeping track of jobs more precisely 
-------------------------------------------------------------------------

                 Key: PIG-2091
                 URL: https://issues.apache.org/jira/browse/PIG-2091
             Project: Pig
          Issue Type: Improvement
            Reporter: Laukik Chitnis


Parallax/Paratimer seeks to improve the progress estimation of Pig by 
identifying the processing speed at different steps in the processing pipeline 
for each of the jobs. For that, it considers the following factors:
1. (Estimated) Per tuple processing cost (a)
2. (Estimated) Total Number of tuples to be processed (N)
3. The number of tuples which are processed till now (K) 

It also accounts for the dynamic changes to runtime environment by considering:
4. The (calculated) slowdown factor (s) to the per-tuple processing cost, and
5. The current width (w) of the pipeline (number of running mappers/reducers)

Given these parameters, the time remaining for a particular stage in the 
pipeline can be computed as:

s*a*(N-K)/w

Of these, 'a' and 'N' are either estimated from a sample, or learned from a 
previous "debug" run; while 's' and 'w' are dynamically read or calculated.

Paratimer also breaks down each MR job into the following (groups of) stages:
(1) Record reader - Map - Combine
(2) Copy
(3) Sort, and
(4) Reduce

'K' is observed while the job is in progress for each of these stages by 
monitoring the following counters in hadoop:
(1) MAP_INPUT_RECORDS (available in Hadoop)
(2) REDUCE_INPUT_GROUPS (available in Hadoop)
(3) REDUCE_INPUT_RECORDS (available in Hadoop)
(4) REDUCE_COPY_OUTPUT_RECORDS (new counter to be added in Hadoop)

The sum of such estimate of time remaining for each of the stages for all the 
jobs along the critical path of the execution plan, along with a overhead time 
for as yet uninitialized MR jobs, gives us a more precise estimate of the time 
remaining, and thus a better overall progress estimate.

The critical path calculation is targeted as part of PIG-1883; I also propose 
that the estimation of parameters such as 'N' (cardinality estimate) and 'a' be 
handled separately (and tracked in a different jira).

Assuming that the estimates are available, the following action items emerge:
1. The estimated values need to be propagated to the specific operators in the 
pipeline. This can be accomplished by piggy-backing (pun unintended ;) ) on the 
mechanism used for keeping track of line numbers for error reporting.
2. Using these and other observed counters and values, estimate the time 
remaining for each stage, and
3. Calculate the pig script execution percentage complete by estimating the 
progress of jobs along the critical path


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to