Re: Resume failed pig script

Jonathan Coveney Fri, 15 Jun 2012 22:44:27 -0700

Well, you can do this physically by adding load/store boundaries to your
code. Thinking out loud, such a thing could be possible...


At any M/R boundary, you store the intermediate in HDFS, and pig is aware
of this and doesn't automatically delete it (this part in and of itself is
not trivial -- what manages the garbage collection? perhaps that could be
part of the configuration of such a feature). Then, when you rerun a job,
it will look to see if the nodes that it would have saved (since it knows
this at compile time) don't already actually exist.

There are some tricky caveats here... what if your code changes affect
intermediate data? You could save the logical plan as well, but what if you
make a change to a UDF? I am not sure if the benefit of automating this in
the language compared to developing a workflow similar to yours external to
pig is worth the complexity.

But it is intriguing, and is a subset of data caching that we have thought
a lot about here.

2012/6/15 Russell Jurney <russell.jur...@gmail.com>

> In production I use short Pig scripts and schedule them with Azkaban
> with dependencies setup, so that I can use Azkaban to restart long
> data pipelines at the point of failure. I edit the failing pig script,
> usually towards the end of the data pipeline, and restart the Azkaban
> job. This saves hours and hours of repeated processing.
>
> I wish Pig could do this. To resume at its point of failure when
> re-run from the command line. Is this feasible?
>
> Russell Jurney
> twitter.com/rjurney
> russell.jur...@gmail.com
> datasyndrome.com
>

Re: Resume failed pig script

Reply via email to