Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Abhishek Agarwal Sun, 25 Oct 2015 02:43:03 -0700


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java,
> >  line 491
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line491>
> >
> >     Even if you skip deleting intermediate files here, it will delete in 
> > finally block of Main.java


When the jobCheckpoint is enabled, temporary output is written inside the 
staging container. Necessary transformation happens in mr.transform() call. In 
the finally block of main, only the temporary container is deleted.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java,
> >  line 507
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line507>
> >
> >     Just storing the current plan? how about what part of it has succeeded?

Success or failure is being inferred through the commit file in the output path 
of an intermediate job. If the output path SUCCESS file is present, then it is 
assumed that job has succeeded.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java,
> >  line 532
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line532>
> >
> >     Creating files that user is not expecting in output directories will be 
> > a problem.

That is understandable. Another approach could be store the completion state of 
the job along with output path. We can opt to not rerun the job if last state 
was successful and the directory is still present. Should we also note the 
timestamp of the directory?


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java,
> >  line 103
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line103>
> >
> >     This might have to do more checks to skip some settings that usually 
> > change between runs but do not affect recovering. For eg: Running through 
> > Oozie, for a rerun you will get a different launcher job id in the config.

I probably missed this configuration. Skipping custom hard-coded settings is 
not feasible. User can give as an option to skip some configurations but that 
will make things complex for the user. I am now inclining toward an approach 
similar to oozie. For a rerun, user can explicitly specifiy rerun option. Pig 
will simply use the new configuration and recover the job. At least then 
behavior is easier to explain.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java,
> >  line 119
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line119>
> >
> >     This might not be as simple as removing the operators. You might also 
> > have to traverse the plan and remove any predecessors and other corner 
> > cases.

Since we are walking in dependency order, the predecessor should have already 
been removed. If not, current node will not recover


- Abhishek


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------


On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2015, 11:30 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
> finish. In case of transient errors, the job fails. When the job is rerun, 
> all the nodes in Job graph will rerun. Some of these nodes may have already 
> run successfully. Redundant runs lead to wastage of cluster capacity and 
> pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the 
> failed nodes and their successors will rerun. This is of course subject to 
> preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 03b36a5 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
>  595e68c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java
>  4b62112 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java
>  PRE-CREATION 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
>   src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>

Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Reply via email to