That would be nice.
Also letting the error handler output the result to a relation would be useful.
(To let the script output application error metrics)
For example it could (optionally) use the keyword INTO just like the SPLIT 
operator.

FOO = LOAD ...;
A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;

ErrorHandler would look a little more like EvalFunc:

public interface ErrorHandler<T> {

  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

public Schema outputSchema(Schema input);

}

There could be a built-in handler to output the skipped record (input: tuple, 
funcname:chararray, errorMessage:chararray)

A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;

Julien

On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:

I was thinking about this..

We add an optional ON_ERROR clause to operators, which allows a user to
specify error handling. The error handler would be a udf that would
implement an interface along these lines:

public interface ErrorHandler {

  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
IOException;

}

I think this makes sense not to make a static method so that users could
keep required state, and for example have the handler throw its own
IOException of it's been invoked too many times.

D


On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <s...@yahoo-inc.com>wrote:

> Thanks for the clarification Ashutosh.
>
> Implementing this in the user realm is tricky as Dmitriy states.
> Sensitivity to error thresholds will require support from the system. We can
> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
> users classify each record. The system can then track counts of each record
> type to facilitate the computation of thresholds. The last part is to allow
> users to specify thresholds and appropriate actions (interrupt, exit,
> continue, etc.). A possible mechanism to realize this is the
> ErrorHandlingUDF described by Dmitriy.
>
> Santhosh
>
> -----Original Message-----
> From: Ashutosh Chauhan [mailto:hashut...@apache.org]
> Sent: Friday, January 14, 2011 7:35 PM
> To: u...@pig.apache.org
> Subject: Re: Exception Handling in Pig Scripts
>
> Santhosh,
>
> The way you are proposing, it will kill the pig script. I think what user
> wants is to ignore few "bad records" and to process the rest and get
> results. Problem here is how to let user tell Pig the definition of "bad
> record" and how to let him specify threshold for % of bad records at which
> Pig should fail the script.
>
> Ashutosh
>
> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <s...@yahoo-inc.com>
> wrote:
> > Sorry about the late response.
> >
> > Hadoop n00b is proposing a language extension for error handling, similar
> to the mechanisms in other well known languages like C++, Java, etc.
> >
> > For now, can't the error semantics be handled by the UDF? For exceptional
> scenarios you could throw an ExecException with the right details. The
> physical operator that handles the execution of UDF's traps it for you and
> propagates the error back to the client. You can take a look at any of the
> builtin UDFs to see how Pig handles it internally.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
> > Sent: Tuesday, January 11, 2011 10:41 AM
> > To: u...@pig.apache.org
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Right now error handling is controlled by the UDFs themselves, and there
> is no way to direct it externally.
> > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> trap errors, and then do the specified error handling behavior.. that's a
> bit ugly though.
> >
> > There is a problem with trapping general exceptions of course, in that if
> they happen 0.000001% of the time you can probably just ignore them, but if
> they happen in half your dataset, you want the job to tell you something is
> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
> to solve this general problem, I think that would be a welcome addition.
> >
> > D
> >
> > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2h...@gmail.com>
> wrote:
> >
> >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
> >> date format, but when I try to get the seconds between this and
> >> another date, say 2011-01-01, I get an error that the value is too
> >> large to be fit into int and the process stops. Do we have something
> >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> >> UDF?
> >>
> >> Thanks
> >>
> >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvrya...@gmail.com>
> >> wrote:
> >>
> >> > Create a UDF that verifies the format, and go through a filtering
> >> > step first.
> >> > If you would like to save the malformated records so you can look
> >> > at them later, you can use the SPLIT operator to route the good
> >> > records to your regular workflow, and the bad records some place on
> HDFS.
> >> >
> >> > -D
> >> >
> >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <new2h...@gmail.com>
> wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > I have a pig script that uses piggy bank to calculate date
> differences.
> >> > > Sometimes, when I get a wierd date or wrong format in the input,
> >> > > the
> >> > script
> >> > > throws and error and aborts.
> >> > >
> >> > > Is there a way I could trap these errors and move on without
> >> > > stopping
> >> the
> >> > > execution?
> >> > >
> >> > > Thanks
> >> > >
> >> > > PS: I'm using CDH2 with Pig 0.5
> >> > >
> >> >
> >>
> >
>

Reply via email to