[ 
https://issues.apache.org/jira/browse/PIG-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029677#comment-13029677
 ] 

Laukik Chitnis commented on PIG-1883:
-------------------------------------

> This doesn't lend itself well to automated testing. Any thoughts on how to 
> test how the new progress indicator does versus the existing one? Have you 
> run any initial tests to measure this?

Thats correct; it is difficult to test in an automated fashion. One metric for 
defining the performance of the progress estimator would be similar to what is 
used in the paratimer paper, may be a RMS of the difference from the linear 
time (assuming "ideal" estimates are 0.0 to 1.0 from start time to finish time) 

I manually tested it with various pig scripts that generated different kinds of 
physical plans. In most cases, I observed that the progress report was the same 
for both old and new methods. One simple case where the new method does better 
is when a very small and a very large job are executed in parallel. In this 
case, the old estimate shoots up to 50% very early, and then moves slowly to 
100%, whereas the new estimate grows more gradually from 0-100 as the bigger 
job execution progresses. I haven't yet automated capturing these and analyzing 
the metric yet.


> I don't understand the logic here. Why is it 0% done if ANY job is waiting, 
> etc.? Some of the jobs may be done and some partially done and some not even 
> started.

The 0% is only for those set of jobs that are executing in parallel. For the 
set of jobs that have finished execution in the previous rounds of parallel 
execution, their contribution to the total estimate is 1/#rounds per round of 
execution i.e. per JobControl object (so, #rounds is the length of the critical 
path along the operator plan tree)


> This code shouldn't be in OperatorPlan. We want to keep that as clean as 
> possible. Instead you should build a new Walker type that can do this 
> calculation.

Ah, ok; Will do that.


> You have tabs here and some other spots. Please make sure you use 4 spaces 
> rather than tabs.

I need to change my editor's auto-indentation formatting :)

> Why is a separate method needed here? When users turn on the new progress 
> indicator I assume they don't get the old one too. Given that the interfaces 
> are the same it seems one method should suffice here.

Initially, I put in a separate method assuming that users could have listeners 
for either of them. For example, we could use these separate listeners for the 
performance comparison between the old and new methods. Later on, however, when 
I added the command-line option to choose, I made the old and new methods as an 
either-or choice. Perhaps I should make it possible to have both indicators 
turned on at the same time?

> It looks like this comment got attached to the run method. Also, the method 
> has only one parameter, but two are listed in the comment.

I will fix this.


> Pig's progress estimation should account for parallel job executions
> --------------------------------------------------------------------
>
>                 Key: PIG-1883
>                 URL: https://issues.apache.org/jira/browse/PIG-1883
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Laukik Chitnis
>            Assignee: Laukik Chitnis
>         Attachments: PIG-1883-2.patch
>
>
> Currently, Pig's progress estimation is based on the percentage of jobs 
> completed out of the total number of MR jobs. However, since the MR operators 
> are arranged in a DAG (and hence more than 1 job might be submitted for 
> execution in parallel), the progress estimation can be improved by 
> considering the number of jobs in the critical path, instead of just the 
> total number of jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to