Re: Exception Handling in Pig Scripts

Dmitriy Ryaboy Thu, 20 Jan 2011 04:55:25 -0800

I think this is coming together! I like the idea of a client-side handler
method that allows us to look at all errors in aggregate and make a
decisions based on proportions. How can we guard against catching the wrong
mistakes -- say, letting a mapper that's running on a bad node and fails all
local disk writes finish "successfully" even though properly, the task just
needs to be rerun on a different mapper and normally MR would just take care
of it?
Let's put this on a wiki for wider feedback.


P.S. What's a "rror" and why do we only want one of them?

On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <[email protected]> wrote:

> Some more thoughts.
>
> * Looking at the existing keywords:
> http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
> It seems ONERROR would be better than ON_ERROR for consistency. There is an
> existing ONSCHEMA but no _ based keyword.
>
> * The default behavior should be to die on error and can be overridden as
> follows:
> DEFAULT ONERROR <error handler>;
>
> * Built in error handlers:
> Ignore() => ignores errors by dropping records that cause exceptions
> Fail() => fails the script on error. (default)
> FailOnThreshold(threshold) => fails if number of errors above threshold
>
> * The error handler interface needs a method called on client side after
> the relation is computed to decide what to do next.
> Typically FailOnThreshold will throw an exception if
> (#errors/#input)>threshold using counters.
> public interface ErrorHandler<T> {
>
> // input is not the input of the UDF, it's the tuple from the relation
> T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>  IOException;
>
> Schema outputSchema(Schema input);
>
> // called afterwards on the client side
> void collectResult() throws IOException;
>
> }
>
> * SPLIT is optional
>
> example:
> DEFAULT ONERROR Ignore();
> ...
>
> DESCRIBE A;
> A: {name: chararray, age: int, gpa: float}
>
> -- fail it more than 1% errors
> B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
> FailOnThreshold(0.01) ;
>
> -- need to make sure the twitter infrastructure can handle the load
> C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>
> -- custom handler that counts errors and logs on the client side
> D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() ;
>
> -- uses default handler and SPLIT
> B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> B2_ERRORS;
>
> -- B2_ERRORS can not really contain the input to the UDF as it would have a
> different schema depending on what UDF failed
> DESCRIBE B_ERRORS;
> B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: chararray,
> error:(class: chararray, message: chararray, stacktrace: chararray) }
>
> -- example of filtering on the udf
> C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
> C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>
> Julien
>
> On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <[email protected]> wrote:
>
> We should think more about the interface.
> For example, "Tuple input" argument -- is that the tuple that was passed to
> the udf, or the whole tuple that was being processed? I can see wanting
> both.
> Also the Handler should probably have init and finish methods in case some
> accumulation is happening, or state needs to get set up...
>
> not sure about "splitting" into a table. Maybe more like
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
> A_ERRORS;
>
> "use" and "into" are optional syntactic sugar.
>
> This allows us to do any combination of:
> - die
> - put original record into a table
> - process the error using a custom handler (which can increment counters,
> write to dbs, send tweets... definitely send tweets...)
>
> D
>
> On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <[email protected]
> >wrote:
>
> > That would be nice.
> > Also letting the error handler output the result to a relation would be
> > useful.
> > (To let the script output application error metrics)
> > For example it could (optionally) use the keyword INTO just like the
> SPLIT
> > operator.
> >
> > FOO = LOAD ...;
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> >
> > ErrorHandler would look a little more like EvalFunc:
> >
> > public interface ErrorHandler<T> {
> >
> >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> > IOException;
> >
> > public Schema outputSchema(Schema input);
> >
> > }
> >
> > There could be a built-in handler to output the skipped record (input:
> > tuple, funcname:chararray, errorMessage:chararray)
> >
> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> >
> > Julien
> >
> > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <[email protected]> wrote:
> >
> > I was thinking about this..
> >
> > We add an optional ON_ERROR clause to operators, which allows a user to
> > specify error handling. The error handler would be a udf that would
> > implement an interface along these lines:
> >
> > public interface ErrorHandler {
> >
> >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
> throws
> > IOException;
> >
> > }
> >
> > I think this makes sense not to make a static method so that users could
> > keep required state, and for example have the handler throw its own
> > IOException of it's been invoked too many times.
> >
> > D
> >
> >
> > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <[email protected]
> > >wrote:
> >
> > > Thanks for the clarification Ashutosh.
> > >
> > > Implementing this in the user realm is tricky as Dmitriy states.
> > > Sensitivity to error thresholds will require support from the system.
> We
> > can
> > > probably provide a taxonomy of records (good, bad, incomplete, etc.) to
> > let
> > > users classify each record. The system can then track counts of each
> > record
> > > type to facilitate the computation of thresholds. The last part is to
> > allow
> > > users to specify thresholds and appropriate actions (interrupt, exit,
> > > continue, etc.). A possible mechanism to realize this is the
> > > ErrorHandlingUDF described by Dmitriy.
> > >
> > > Santhosh
> > >
> > > -----Original Message-----
> > > From: Ashutosh Chauhan [mailto:[email protected]]
> > > Sent: Friday, January 14, 2011 7:35 PM
> > > To: [email protected]
> > > Subject: Re: Exception Handling in Pig Scripts
> > >
> > > Santhosh,
> > >
> > > The way you are proposing, it will kill the pig script. I think what
> user
> > > wants is to ignore few "bad records" and to process the rest and get
> > > results. Problem here is how to let user tell Pig the definition of
> "bad
> > > record" and how to let him specify threshold for % of bad records at
> > which
> > > Pig should fail the script.
> > >
> > > Ashutosh
> > >
> > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <[email protected]>
> > > wrote:
> > > > Sorry about the late response.
> > > >
> > > > Hadoop n00b is proposing a language extension for error handling,
> > similar
> > > to the mechanisms in other well known languages like C++, Java, etc.
> > > >
> > > > For now, can't the error semantics be handled by the UDF? For
> > exceptional
> > > scenarios you could throw an ExecException with the right details. The
> > > physical operator that handles the execution of UDF's traps it for you
> > and
> > > propagates the error back to the client. You can take a look at any of
> > the
> > > builtin UDFs to see how Pig handles it internally.
> > > >
> > > > Santhosh
> > > >
> > > > -----Original Message-----
> > > > From: Dmitriy Ryaboy [mailto:[email protected]]
> > > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > > To: [email protected]
> > > > Subject: Re: Exception Handling in Pig Scripts
> > > >
> > > > Right now error handling is controlled by the UDFs themselves, and
> > there
> > > is no way to direct it externally.
> > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
> it,
> > > trap errors, and then do the specified error handling behavior.. that's
> a
> > > bit ugly though.
> > > >
> > > > There is a problem with trapping general exceptions of course, in
> that
> > if
> > > they happen 0.000001% of the time you can probably just ignore them,
> but
> > if
> > > they happen in half your dataset, you want the job to tell you
> something
> > is
> > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> > design
> > > to solve this general problem, I think that would be a welcome
> addition.
> > > >
> > > > D
> > > >
> > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <[email protected]>
> > > wrote:
> > > >
> > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
> valid
> > > >> date format, but when I try to get the seconds between this and
> > > >> another date, say 2011-01-01, I get an error that the value is too
> > > >> large to be fit into int and the process stops. Do we have something
> > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> > > >> UDF?
> > > >>
> > > >> Thanks
> > > >>
> > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
> [email protected]>
> > > >> wrote:
> > > >>
> > > >> > Create a UDF that verifies the format, and go through a filtering
> > > >> > step first.
> > > >> > If you would like to save the malformated records so you can look
> > > >> > at them later, you can use the SPLIT operator to route the good
> > > >> > records to your regular workflow, and the bad records some place
> on
> > > HDFS.
> > > >> >
> > > >> > -D
> > > >> >
> > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <[email protected]>
> > > wrote:
> > > >> >
> > > >> > > Hello,
> > > >> > >
> > > >> > > I have a pig script that uses piggy bank to calculate date
> > > differences.
> > > >> > > Sometimes, when I get a wierd date or wrong format in the input,
> > > >> > > the
> > > >> > script
> > > >> > > throws and error and aborts.
> > > >> > >
> > > >> > > Is there a way I could trap these errors and move on without
> > > >> > > stopping
> > > >> the
> > > >> > > execution?
> > > >> > >
> > > >> > > Thanks
> > > >> > >
> > > >> > > PS: I'm using CDH2 with Pig 0.5
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
>
>

Re: Exception Handling in Pig Scripts

Reply via email to