Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Rohini Palaniswamy Wed, 21 Oct 2015 04:20:08 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39226/#review103384
-----------------------------------------------------------



Overall approach in general looks good and viable. There might be more of 
condition checks to be added and corner cases to handle as you go along and 
test out the implementation. I have commented on whatever gaps I could think of 
at the moment.


src/org/apache/pig/PigConfiguration.java (line 233)
<https://reviews.apache.org/r/39226/#comment161406>

    Internal configs should go in PigImplConstants.java



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
 (line 454)
<https://reviews.apache.org/r/39226/#comment161418>

    Even if you skip deleting intermediate files here, it will delete in 
finally block of Main.java



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
 (line 470)
<https://reviews.apache.org/r/39226/#comment161434>

    Just storing the current plan? how about what part of it has succeeded?



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
 (line 495)
<https://reviews.apache.org/r/39226/#comment161407>

    Creating files that user is not expecting in output directories will be a 
problem.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 (line 103)
<https://reviews.apache.org/r/39226/#comment161420>

    This might have to do more checks to skip some settings that usually change 
between runs but do not affect recovering. For eg: Running through Oozie, for a 
rerun you will get a different launcher job id in the config.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 (line 107)
<https://reviews.apache.org/r/39226/#comment161422>

    Logical plan signature instead of script hash to take into account param 
changes, change in imported macro functions, etc



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 (line 112)
<https://reviews.apache.org/r/39226/#comment161424>

    You will also have to check the size and checksum of jars and cache files.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 (line 119)
<https://reviews.apache.org/r/39226/#comment161439>

    This might not be as simple as removing the operators. You might also have 
to traverse the plan and remove any predecessors and other corner cases.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 (line 163)
<https://reviews.apache.org/r/39226/#comment161427>

    Use LoadStoreFinder like before. Reducing the number of times plan is 
traversed will help reduce compile time for large scripts.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 (line 184)
<https://reviews.apache.org/r/39226/#comment161436>

    Just the existence of temporary store directory does not guarantee that the 
job ran successfully and it has full output in it. You need to check if the 
mapreduce job that wrote to the temp store location was also successfull. Not 
sure if that is what you intend to do in isFailedMROper(previousMR). Currently 
it is not implemented or does not have TODO comments.
    
    Also there might be cases, data is written to side files in UDFs and temp 
stores are not used. You will have to check the code. Daniel might know off the 
top of his head.



src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
 (line 313)
<https://reviews.apache.org/r/39226/#comment161429>

    The staging directory should be something that can be used by all scripts 
and different runs of the script with different parameters.
    
    Have sub-directory (logical plan hash) to store checkpoint files of a 
specific script so that you can look under it instead of whole staging 
directory. The directory and or the checkpoint files should be deleted once the 
script completes successfully or job cannot be recovered because input 
files/jars or config has changed and previous one cannot be used.



src/org/apache/pig/impl/io/FileLocalizer.java (line 45)
<https://reviews.apache.org/r/39226/#comment161408>

    Can you avoid wildcard imports? Also set your package import order to - 
java, javax, org, com - if that is not the default in your IDE. That is the 
standard used in Pig and would avoid bringing in unnecessary changes to all 
classes when you are contributing.



src/org/apache/pig/tools/pigstats/ScriptState.java (line 259)
<https://reviews.apache.org/r/39226/#comment161410>

    You should be going instead with pig.logical.plan.signature instead of 
creating a hash of the script.  Most of the scripts are parameterized and hash 
of the script is not useful to differentiate when run with different parameters.


- Rohini Palaniswamy


On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39226/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2015, 11:30 a.m.)
> 
> 
> Review request for pig and Rohini Palaniswamy.
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Pig scripts can have multiple ETL jobs in the DAG which may take hours to 
> finish. In case of transient errors, the job fails. When the job is rerun, 
> all the nodes in Job graph will rerun. Some of these nodes may have already 
> run successfully. Redundant runs lead to wastage of cluster capacity and 
> pipeline delays.
> 
> In case of failure, we can persist the graph state. In next run, only the 
> failed nodes and their successors will rerun. This is of course subject to 
> preconditions such as
>          > Pig script has not changed
>          > Input locations have not changed
>          > Output data from previous run is intact
>          > Configuration has not changed
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java 03b36a5 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java
>  595e68c 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java
>  4b62112 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java
>  PRE-CREATION 
>   src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 
>   src/org/apache/pig/tools/grunt/GruntParser.java 439d087 
>   src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 
> 
> Diff: https://reviews.apache.org/r/39226/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Abhishek Agarwal
> 
>

Re: Review Request 39226: PIG-4680 [Pig workflows can checkpoint the state and can resume from the last successful node]

Reply via email to