Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Abhishek Agarwal Fri, 27 Nov 2015 05:54:08 -0800


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java,
> >  line 313
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line313>
> >
> >     The staging directory should be something that can be used by all 
> > scripts and different runs of the script with different parameters.
> >     
> >     Have sub-directory (logical plan hash) to store checkpoint files of a 
> > specific script so that you can look under it instead of whole staging 
> > directory. The directory and or the checkpoint files should be deleted once 
> > the script completes successfully or job cannot be recovered because input 
> > files/jars or config has changed and previous one cannot be used.


Checkpoint directory is supposed to be unique across different runs so that a 
rerun of a failed script does not clash with the fresh run of the same script.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java,
> >  line 107
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line107>
> >
> >     Logical plan signature instead of script hash to take into account 
> > param changes, change in imported macro functions, etc

done


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java,
> >  line 103
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line103>
> >
> >     This might have to do more checks to skip some settings that usually 
> > change between runs but do not affect recovering. For eg: Running through 
> > Oozie, for a rerun you will get a different launcher job id in the config.
> 
> Abhishek Agarwal wrote:
>     I probably missed this configuration. Skipping custom hard-coded settings 
> is not feasible. User can give as an option to skip some configurations but 
> that will make things complex for the user. I am now inclining toward an 
> approach similar to oozie. For a rerun, user can explicitly specifiy rerun 
> option. Pig will simply use the new configuration and recover the job. At 
> least then behavior is easier to explain.

I have removed the check for configuration. Pig script will ignore the 
configuration changes as they are very hard to track.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java,
> >  line 532
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line532>
> >
> >     Creating files that user is not expecting in output directories will be 
> > a problem.
> 
> Abhishek Agarwal wrote:
>     That is understandable. Another approach could be store the completion 
> state of the job along with output path. We can opt to not rerun the job if 
> last state was successful and the directory is still present. Should we also 
> note the timestamp of the directory?

On further thinking, temporary directories are not visible to the user. So it 
should be fine if we create the _SUCCESS file in them.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java,
> >  line 163
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line163>
> >
> >     Use LoadStoreFinder like before. Reducing the number of times plan is 
> > traversed will help reduce compile time for large scripts.

Done.


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java,
> >  line 184
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line184>
> >
> >     Just the existence of temporary store directory does not guarantee that 
> > the job ran successfully and it has full output in it. You need to check if 
> > the mapreduce job that wrote to the temp store location was also 
> > successfull. Not sure if that is what you intend to do in 
> > isFailedMROper(previousMR). Currently it is not implemented or does not 
> > have TODO comments.
> >     
> >     Also there might be cases, data is written to side files in UDFs and 
> > temp stores are not used. You will have to check the code. Daniel might 
> > know off the top of his head.

can you give me some examples in code?


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/tools/pigstats/ScriptState.java, line 302
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095357#file1095357line302>
> >
> >     You should be going instead with pig.logical.plan.signature instead of 
> > creating a hash of the script.  Most of the scripts are parameterized and 
> > hash of the script is not useful to differentiate when run with different 
> > parameters.

Done


> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/impl/io/FileLocalizer.java, line 63
> > <https://reviews.apache.org/r/39226/diff/1/?file=1095355#file1095355line63>
> >
> >     Can you avoid wildcard imports? Also set your package import order to - 
> > java, javax, org, com - if that is not the default in your IDE. That is the 
> > standard used in Pig and would avoid bringing in unnecessary changes to all 
> > classes when you are contributing.

Done


- Abhishek


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------


On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2015, 11:30 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
> finish. In case of transient errors, the job fails. When the job is rerun, 
> all the nodes in Job graph will rerun. Some of these nodes may have already 
> run successfully. Redundant runs lead to wastage of cluster capacity and 
> pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the 
> failed nodes and their successors will rerun. This is of course subject to 
> preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 03b36a5 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
>  595e68c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java
>  4b62112 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java
>  PRE-CREATION 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
>   src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>

Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Reply via email to