> On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, > > line 313 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line313> > > > > The staging directory should be something that can be used by all > > scripts and different runs of the script with different parameters. > > > > Have sub-directory (logical plan hash) to store checkpoint files of a > > specific script so that you can look under it instead of whole staging > > directory. The directory and or the checkpoint files should be deleted once > > the script completes successfully or job cannot be recovered because input > > files/jars or config has changed and previous one cannot be used.
Checkpoint directory is supposed to be unique across different runs so that a rerun of a failed script does not clash with the fresh run of the same script. > On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, > > line 107 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line107> > > > > Logical plan signature instead of script hash to take into account > > param changes, change in imported macro functions, etc done > On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, > > line 103 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line103> > > > > This might have to do more checks to skip some settings that usually > > change between runs but do not affect recovering. For eg: Running through > > Oozie, for a rerun you will get a different launcher job id in the config. > > Abhishek Agarwal wrote: > I probably missed this configuration. Skipping custom hard-coded settings > is not feasible. User can give as an option to skip some configurations but > that will make things complex for the user. I am now inclining toward an > approach similar to oozie. For a rerun, user can explicitly specifiy rerun > option. Pig will simply use the new configuration and recover the job. At > least then behavior is easier to explain. I have removed the check for configuration. Pig script will ignore the configuration changes as they are very hard to track. > On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java, > > line 532 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095351#file1095351line532> > > > > Creating files that user is not expecting in output directories will be > > a problem. > > Abhishek Agarwal wrote: > That is understandable. Another approach could be store the completion > state of the job along with output path. We can opt to not rerun the job if > last state was successful and the directory is still present. Should we also > note the timestamp of the directory? On further thinking, temporary directories are not visible to the user. So it should be fine if we create the _SUCCESS file in them. > On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, > > line 163 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line163> > > > > Use LoadStoreFinder like before. Reducing the number of times plan is > > traversed will help reduce compile time for large scripts. Done. > On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java, > > line 184 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095353#file1095353line184> > > > > Just the existence of temporary store directory does not guarantee that > > the job ran successfully and it has full output in it. You need to check if > > the mapreduce job that wrote to the temp store location was also > > successfull. Not sure if that is what you intend to do in > > isFailedMROper(previousMR). Currently it is not implemented or does not > > have TODO comments. > > > > Also there might be cases, data is written to side files in UDFs and > > temp stores are not used. You will have to check the code. Daniel might > > know off the top of his head. can you give me some examples in code? > On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/tools/pigstats/ScriptState.java, line 302 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095357#file1095357line302> > > > > You should be going instead with pig.logical.plan.signature instead of > > creating a hash of the script. Most of the scripts are parameterized and > > hash of the script is not useful to differentiate when run with different > > parameters. Done > On Oct. 21, 2015, 11:19 a.m., Rohini Palaniswamy wrote: > > src/org/apache/pig/impl/io/FileLocalizer.java, line 63 > > <https://reviews.apache.org/r/39226/diff/1/?file=1095355#file1095355line63> > > > > Can you avoid wildcard imports? Also set your package import order to - > > java, javax, org, com - if that is not the default in your IDE. That is the > > standard used in Pig and would avoid bringing in unnecessary changes to all > > classes when you are contributing. Done - Abhishek ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/39226/#review103384 ----------------------------------------------------------- On Oct. 12, 2015, 11:30 a.m., Abhishek Agarwal wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/39226/ > ----------------------------------------------------------- > > (Updated Oct. 12, 2015, 11:30 a.m.) > > > Review request for pig and Rohini Palaniswamy. > > > Repository: pig-git > > > Description > ------- > > Pig scripts can have multiple ETL jobs in the DAG which may take hours to > finish. In case of transient errors, the job fails. When the job is rerun, > all the nodes in Job graph will rerun. Some of these nodes may have already > run successfully. Redundant runs lead to wastage of cluster capacity and > pipeline delays. > > In case of failure, we can persist the graph state. In next run, only the > failed nodes and their successors will rerun. This is of course subject to > preconditions such as > > Pig script has not changed > > Input locations have not changed > > Output data from previous run is intact > > Configuration has not changed > > > Diffs > ----- > > src/org/apache/pig/PigConfiguration.java 03b36a5 > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MapReduceLauncher.java > 595e68c > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRIntermediateDataVisitor.java > 4b62112 > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobRecovery.java > PRE-CREATION > > src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/MRJobState.java > PRE-CREATION > src/org/apache/pig/impl/io/FileLocalizer.java f0f9b43 > src/org/apache/pig/tools/grunt/GruntParser.java 439d087 > src/org/apache/pig/tools/pigstats/ScriptState.java 03a12b1 > > Diff: https://reviews.apache.org/r/39226/diff/ > > > Testing > ------- > > > Thanks, > > Abhishek Agarwal > >
