If its not already been discussed, how does this interact with hadoop's feature of skipping bad records: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/SkipBadRecords.html
Ashutosh On Thu, Jan 20, 2011 at 12:53, Olga Natkovich <ol...@yahoo-inc.com> wrote: > Hi guys, > > Could you put a quick wiki with your proposal together? I think it would make > it much easier then following email discussion. > > Thanks, > > Olga > > -----Original Message----- > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] > Sent: Thursday, January 20, 2011 11:52 AM > To: dev@pig.apache.org > Subject: Re: Exception Handling in Pig Scripts > > Right, what I am saying is that the tasks would not fail because we'd catch > the errors. > > Thanks for the lmyit link.. learn something new every day. > > On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <led...@yahoo-inc.com>wrote: > >> Doesn't Hadoop discard the increments to counters done by failed tasks? (I >> would expect that, but I don't know) >> Also using counters we should make sure we don't mix up multiple relations >> being combined by the optimizer. >> >> P.S.: Regarding rror, I don't see why you would want two of these: >> http://lmyit.com/rror >> :P >> >> >> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: >> >> I think this is coming together! I like the idea of a client-side handler >> method that allows us to look at all errors in aggregate and make a >> decisions based on proportions. How can we guard against catching the wrong >> mistakes -- say, letting a mapper that's running on a bad node and fails >> all >> local disk writes finish "successfully" even though properly, the task just >> needs to be rerun on a different mapper and normally MR would just take >> care >> of it? >> Let's put this on a wiki for wider feedback. >> >> P.S. What's a "rror" and why do we only want one of them? >> >> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <led...@yahoo-inc.com> >> wrote: >> >> > Some more thoughts. >> > >> > * Looking at the existing keywords: >> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords >> > It seems ONERROR would be better than ON_ERROR for consistency. There is >> an >> > existing ONSCHEMA but no _ based keyword. >> > >> > * The default behavior should be to die on error and can be overridden as >> > follows: >> > DEFAULT ONERROR <error handler>; >> > >> > * Built in error handlers: >> > Ignore() => ignores errors by dropping records that cause exceptions >> > Fail() => fails the script on error. (default) >> > FailOnThreshold(threshold) => fails if number of errors above threshold >> > >> > * The error handler interface needs a method called on client side after >> > the relation is computed to decide what to do next. >> > Typically FailOnThreshold will throw an exception if >> > (#errors/#input)>threshold using counters. >> > public interface ErrorHandler<T> { >> > >> > // input is not the input of the UDF, it's the tuple from the relation >> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws >> > IOException; >> > >> > Schema outputSchema(Schema input); >> > >> > // called afterwards on the client side >> > void collectResult() throws IOException; >> > >> > } >> > >> > * SPLIT is optional >> > >> > example: >> > DEFAULT ONERROR Ignore(); >> > ... >> > >> > DESCRIBE A; >> > A: {name: chararray, age: int, gpa: float} >> > >> > -- fail it more than 1% errors >> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR >> > FailOnThreshold(0.01) ; >> > >> > -- need to make sure the twitter infrastructure can handle the load >> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ; >> > >> > -- custom handler that counts errors and logs on the client side >> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors() >> ; >> > >> > -- uses default handler and SPLIT >> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO >> > B2_ERRORS; >> > >> > -- B2_ERRORS can not really contain the input to the UDF as it would have >> a >> > different schema depending on what UDF failed >> > DESCRIBE B_ERRORS; >> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf: >> chararray, >> > error:(class: chararray, message: chararray, stacktrace: chararray) } >> > >> > -- example of filtering on the udf >> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO >> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar'; >> > >> > Julien >> > >> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: >> > >> > We should think more about the interface. >> > For example, "Tuple input" argument -- is that the tuple that was passed >> to >> > the udf, or the whole tuple that was being processed? I can see wanting >> > both. >> > Also the Handler should probably have init and finish methods in case >> some >> > accumulation is happening, or state needs to get set up... >> > >> > not sure about "splitting" into a table. Maybe more like >> > >> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into] >> > A_ERRORS; >> > >> > "use" and "into" are optional syntactic sugar. >> > >> > This allows us to do any combination of: >> > - die >> > - put original record into a table >> > - process the error using a custom handler (which can increment counters, >> > write to dbs, send tweets... definitely send tweets...) >> > >> > D >> > >> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <led...@yahoo-inc.com >> > >wrote: >> > >> > > That would be nice. >> > > Also letting the error handler output the result to a relation would be >> > > useful. >> > > (To let the script output application error metrics) >> > > For example it could (optionally) use the keyword INTO just like the >> > SPLIT >> > > operator. >> > > >> > > FOO = LOAD ...; >> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS; >> > > >> > > ErrorHandler would look a little more like EvalFunc: >> > > >> > > public interface ErrorHandler<T> { >> > > >> > > public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws >> > > IOException; >> > > >> > > public Schema outputSchema(Schema input); >> > > >> > > } >> > > >> > > There could be a built-in handler to output the skipped record (input: >> > > tuple, funcname:chararray, errorMessage:chararray) >> > > >> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS; >> > > >> > > Julien >> > > >> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: >> > > >> > > I was thinking about this.. >> > > >> > > We add an optional ON_ERROR clause to operators, which allows a user to >> > > specify error handling. The error handler would be a udf that would >> > > implement an interface along these lines: >> > > >> > > public interface ErrorHandler { >> > > >> > > public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) >> > throws >> > > IOException; >> > > >> > > } >> > > >> > > I think this makes sense not to make a static method so that users >> could >> > > keep required state, and for example have the handler throw its own >> > > IOException of it's been invoked too many times. >> > > >> > > D >> > > >> > > >> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan < >> s...@yahoo-inc.com >> > > >wrote: >> > > >> > > > Thanks for the clarification Ashutosh. >> > > > >> > > > Implementing this in the user realm is tricky as Dmitriy states. >> > > > Sensitivity to error thresholds will require support from the system. >> > We >> > > can >> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.) >> to >> > > let >> > > > users classify each record. The system can then track counts of each >> > > record >> > > > type to facilitate the computation of thresholds. The last part is to >> > > allow >> > > > users to specify thresholds and appropriate actions (interrupt, exit, >> > > > continue, etc.). A possible mechanism to realize this is the >> > > > ErrorHandlingUDF described by Dmitriy. >> > > > >> > > > Santhosh >> > > > >> > > > -----Original Message----- >> > > > From: Ashutosh Chauhan [mailto:hashut...@apache.org] >> > > > Sent: Friday, January 14, 2011 7:35 PM >> > > > To: u...@pig.apache.org >> > > > Subject: Re: Exception Handling in Pig Scripts >> > > > >> > > > Santhosh, >> > > > >> > > > The way you are proposing, it will kill the pig script. I think what >> > user >> > > > wants is to ignore few "bad records" and to process the rest and get >> > > > results. Problem here is how to let user tell Pig the definition of >> > "bad >> > > > record" and how to let him specify threshold for % of bad records at >> > > which >> > > > Pig should fail the script. >> > > > >> > > > Ashutosh >> > > > >> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan < >> s...@yahoo-inc.com> >> > > > wrote: >> > > > > Sorry about the late response. >> > > > > >> > > > > Hadoop n00b is proposing a language extension for error handling, >> > > similar >> > > > to the mechanisms in other well known languages like C++, Java, etc. >> > > > > >> > > > > For now, can't the error semantics be handled by the UDF? For >> > > exceptional >> > > > scenarios you could throw an ExecException with the right details. >> The >> > > > physical operator that handles the execution of UDF's traps it for >> you >> > > and >> > > > propagates the error back to the client. You can take a look at any >> of >> > > the >> > > > builtin UDFs to see how Pig handles it internally. >> > > > > >> > > > > Santhosh >> > > > > >> > > > > -----Original Message----- >> > > > > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] >> > > > > Sent: Tuesday, January 11, 2011 10:41 AM >> > > > > To: u...@pig.apache.org >> > > > > Subject: Re: Exception Handling in Pig Scripts >> > > > > >> > > > > Right now error handling is controlled by the UDFs themselves, and >> > > there >> > > > is no way to direct it externally. >> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke >> > it, >> > > > trap errors, and then do the specified error handling behavior.. >> that's >> > a >> > > > bit ugly though. >> > > > > >> > > > > There is a problem with trapping general exceptions of course, in >> > that >> > > if >> > > > they happen 0.000001% of the time you can probably just ignore them, >> > but >> > > if >> > > > they happen in half your dataset, you want the job to tell you >> > something >> > > is >> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a >> > > design >> > > > to solve this general problem, I think that would be a welcome >> > addition. >> > > > > >> > > > > D >> > > > > >> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2h...@gmail.com> >> > > > wrote: >> > > > > >> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a >> > valid >> > > > >> date format, but when I try to get the seconds between this and >> > > > >> another date, say 2011-01-01, I get an error that the value is too >> > > > >> large to be fit into int and the process stops. Do we have >> something >> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as >> an >> > > > >> UDF? >> > > > >> >> > > > >> Thanks >> > > > >> >> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy < >> > dvrya...@gmail.com> >> > > > >> wrote: >> > > > >> >> > > > >> > Create a UDF that verifies the format, and go through a >> filtering >> > > > >> > step first. >> > > > >> > If you would like to save the malformated records so you can >> look >> > > > >> > at them later, you can use the SPLIT operator to route the good >> > > > >> > records to your regular workflow, and the bad records some place >> > on >> > > > HDFS. >> > > > >> > >> > > > >> > -D >> > > > >> > >> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b < >> new2h...@gmail.com> >> > > > wrote: >> > > > >> > >> > > > >> > > Hello, >> > > > >> > > >> > > > >> > > I have a pig script that uses piggy bank to calculate date >> > > > differences. >> > > > >> > > Sometimes, when I get a wierd date or wrong format in the >> input, >> > > > >> > > the >> > > > >> > script >> > > > >> > > throws and error and aborts. >> > > > >> > > >> > > > >> > > Is there a way I could trap these errors and move on without >> > > > >> > > stopping >> > > > >> the >> > > > >> > > execution? >> > > > >> > > >> > > > >> > > Thanks >> > > > >> > > >> > > > >> > > PS: I'm using CDH2 with Pig 0.5 >> > > > >> > > >> > > > >> > >> > > > >> >> > > > > >> > > > >> > > >> > > >> > >> > >> >> >