Well, you can do this physically by adding load/store boundaries to your code. Thinking out loud, such a thing could be possible...
At any M/R boundary, you store the intermediate in HDFS, and pig is aware of this and doesn't automatically delete it (this part in and of itself is not trivial -- what manages the garbage collection? perhaps that could be part of the configuration of such a feature). Then, when you rerun a job, it will look to see if the nodes that it would have saved (since it knows this at compile time) don't already actually exist. There are some tricky caveats here... what if your code changes affect intermediate data? You could save the logical plan as well, but what if you make a change to a UDF? I am not sure if the benefit of automating this in the language compared to developing a workflow similar to yours external to pig is worth the complexity. But it is intriguing, and is a subset of data caching that we have thought a lot about here. 2012/6/15 Russell Jurney <russell.jur...@gmail.com> > In production I use short Pig scripts and schedule them with Azkaban > with dependencies setup, so that I can use Azkaban to restart long > data pipelines at the point of failure. I edit the failing pig script, > usually towards the end of the data pipeline, and restart the Azkaban > job. This saves hours and hours of repeated processing. > > I wish Pig could do this. To resume at its point of failure when > re-run from the command line. Is this feasible? > > Russell Jurney > twitter.com/rjurney > russell.jur...@gmail.com > datasyndrome.com >