That would be nice. Also letting the error handler output the result to a relation would be useful. (To let the script output application error metrics) For example it could (optionally) use the keyword INTO just like the SPLIT operator.
FOO = LOAD ...; A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS; ErrorHandler would look a little more like EvalFunc: public interface ErrorHandler<T> { public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws IOException; public Schema outputSchema(Schema input); } There could be a built-in handler to output the skipped record (input: tuple, funcname:chararray, errorMessage:chararray) A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS; Julien On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: I was thinking about this.. We add an optional ON_ERROR clause to operators, which allows a user to specify error handling. The error handler would be a udf that would implement an interface along these lines: public interface ErrorHandler { public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws IOException; } I think this makes sense not to make a static method so that users could keep required state, and for example have the handler throw its own IOException of it's been invoked too many times. D On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <s...@yahoo-inc.com>wrote: > Thanks for the clarification Ashutosh. > > Implementing this in the user realm is tricky as Dmitriy states. > Sensitivity to error thresholds will require support from the system. We can > probably provide a taxonomy of records (good, bad, incomplete, etc.) to let > users classify each record. The system can then track counts of each record > type to facilitate the computation of thresholds. The last part is to allow > users to specify thresholds and appropriate actions (interrupt, exit, > continue, etc.). A possible mechanism to realize this is the > ErrorHandlingUDF described by Dmitriy. > > Santhosh > > -----Original Message----- > From: Ashutosh Chauhan [mailto:hashut...@apache.org] > Sent: Friday, January 14, 2011 7:35 PM > To: u...@pig.apache.org > Subject: Re: Exception Handling in Pig Scripts > > Santhosh, > > The way you are proposing, it will kill the pig script. I think what user > wants is to ignore few "bad records" and to process the rest and get > results. Problem here is how to let user tell Pig the definition of "bad > record" and how to let him specify threshold for % of bad records at which > Pig should fail the script. > > Ashutosh > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <s...@yahoo-inc.com> > wrote: > > Sorry about the late response. > > > > Hadoop n00b is proposing a language extension for error handling, similar > to the mechanisms in other well known languages like C++, Java, etc. > > > > For now, can't the error semantics be handled by the UDF? For exceptional > scenarios you could throw an ExecException with the right details. The > physical operator that handles the execution of UDF's traps it for you and > propagates the error back to the client. You can take a look at any of the > builtin UDFs to see how Pig handles it internally. > > > > Santhosh > > > > -----Original Message----- > > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] > > Sent: Tuesday, January 11, 2011 10:41 AM > > To: u...@pig.apache.org > > Subject: Re: Exception Handling in Pig Scripts > > > > Right now error handling is controlled by the UDFs themselves, and there > is no way to direct it externally. > > You can make an ErrorHandlingUDF that would take a udf spec, invoke it, > trap errors, and then do the specified error handling behavior.. that's a > bit ugly though. > > > > There is a problem with trapping general exceptions of course, in that if > they happen 0.000001% of the time you can probably just ignore them, but if > they happen in half your dataset, you want the job to tell you something is > wrong. So this stuff gets non-trivial. If anyone wants to propose a design > to solve this general problem, I think that would be a welcome addition. > > > > D > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2h...@gmail.com> > wrote: > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid > >> date format, but when I try to get the seconds between this and > >> another date, say 2011-01-01, I get an error that the value is too > >> large to be fit into int and the process stops. Do we have something > >> like ifError(x-y, null,x-y)? Or would I have to implement this as an > >> UDF? > >> > >> Thanks > >> > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvrya...@gmail.com> > >> wrote: > >> > >> > Create a UDF that verifies the format, and go through a filtering > >> > step first. > >> > If you would like to save the malformated records so you can look > >> > at them later, you can use the SPLIT operator to route the good > >> > records to your regular workflow, and the bad records some place on > HDFS. > >> > > >> > -D > >> > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <new2h...@gmail.com> > wrote: > >> > > >> > > Hello, > >> > > > >> > > I have a pig script that uses piggy bank to calculate date > differences. > >> > > Sometimes, when I get a wierd date or wrong format in the input, > >> > > the > >> > script > >> > > throws and error and aborts. > >> > > > >> > > Is there a way I could trap these errors and move on without > >> > > stopping > >> the > >> > > execution? > >> > > > >> > > Thanks > >> > > > >> > > PS: I'm using CDH2 with Pig 0.5 > >> > > > >> > > >> > > >