If its not already been discussed, how does this interact with
hadoop's feature of skipping bad records:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/SkipBadRecords.html

Ashutosh
On Thu, Jan 20, 2011 at 12:53, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> Hi guys,
>
> Could you put a quick wiki with your proposal together? I think it would make 
> it much easier then following email discussion.
>
> Thanks,
>
> Olga
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
> Sent: Thursday, January 20, 2011 11:52 AM
> To: dev@pig.apache.org
> Subject: Re: Exception Handling in Pig Scripts
>
> Right, what I am saying is that the tasks would not fail because we'd catch
> the errors.
>
> Thanks for the lmyit link.. learn something new every day.
>
> On Thu, Jan 20, 2011 at 11:31 AM, Julien Le Dem <led...@yahoo-inc.com>wrote:
>
>> Doesn't Hadoop discard the increments to counters done by failed tasks? (I
>> would expect that, but I don't know)
>> Also using counters we should make sure we don't mix up multiple relations
>> being combined by the optimizer.
>>
>> P.S.: Regarding rror, I don't see why you would want two of these:
>> http://lmyit.com/rror
>> :P
>>
>>
>> On 1/20/11 4:54 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:
>>
>> I think this is coming together! I like the idea of a client-side handler
>> method that allows us to look at all errors in aggregate and make a
>> decisions based on proportions. How can we guard against catching the wrong
>> mistakes -- say, letting a mapper that's running on a bad node and fails
>> all
>> local disk writes finish "successfully" even though properly, the task just
>> needs to be rerun on a different mapper and normally MR would just take
>> care
>> of it?
>> Let's put this on a wiki for wider feedback.
>>
>> P.S. What's a "rror" and why do we only want one of them?
>>
>> On Wed, Jan 19, 2011 at 3:07 PM, Julien Le Dem <led...@yahoo-inc.com>
>> wrote:
>>
>> > Some more thoughts.
>> >
>> > * Looking at the existing keywords:
>> > http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Reserved+Keywords
>> > It seems ONERROR would be better than ON_ERROR for consistency. There is
>> an
>> > existing ONSCHEMA but no _ based keyword.
>> >
>> > * The default behavior should be to die on error and can be overridden as
>> > follows:
>> > DEFAULT ONERROR <error handler>;
>> >
>> > * Built in error handlers:
>> > Ignore() => ignores errors by dropping records that cause exceptions
>> > Fail() => fails the script on error. (default)
>> > FailOnThreshold(threshold) => fails if number of errors above threshold
>> >
>> > * The error handler interface needs a method called on client side after
>> > the relation is computed to decide what to do next.
>> > Typically FailOnThreshold will throw an exception if
>> > (#errors/#input)>threshold using counters.
>> > public interface ErrorHandler<T> {
>> >
>> > // input is not the input of the UDF, it's the tuple from the relation
>> > T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> >  IOException;
>> >
>> > Schema outputSchema(Schema input);
>> >
>> > // called afterwards on the client side
>> > void collectResult() throws IOException;
>> >
>> > }
>> >
>> > * SPLIT is optional
>> >
>> > example:
>> > DEFAULT ONERROR Ignore();
>> > ...
>> >
>> > DESCRIBE A;
>> > A: {name: chararray, age: int, gpa: float}
>> >
>> > -- fail it more than 1% errors
>> > B1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR
>> > FailOnThreshold(0.01) ;
>> >
>> > -- need to make sure the twitter infrastructure can handle the load
>> > C1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR Tweet() ;
>> >
>> > -- custom handler that counts errors and logs on the client side
>> > D1 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR CountMyErrors()
>> ;
>> >
>> > -- uses default handler and SPLIT
>> > B2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
>> > B2_ERRORS;
>> >
>> > -- B2_ERRORS can not really contain the input to the UDF as it would have
>> a
>> > different schema depending on what UDF failed
>> > DESCRIBE B_ERRORS;
>> > B2_ERRORS: {input: (name: chararray, age: int, gpa: float), udf:
>> chararray,
>> > error:(class: chararray, message: chararray, stacktrace: chararray) }
>> >
>> > -- example of filtering on the udf
>> > C2 = FOREACH A GENERATE Foo(age, gpa), Bar(name) ONERROR SPLIT INTO
>> > C2_FOO_ERRORS IF udf='Foo', C2_BAR_ERRORS IF udf='Bar';
>> >
>> > Julien
>> >
>> > On 1/18/11 3:24 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:
>> >
>> > We should think more about the interface.
>> > For example, "Tuple input" argument -- is that the tuple that was passed
>> to
>> > the udf, or the whole tuple that was being processed? I can see wanting
>> > both.
>> > Also the Handler should probably have init and finish methods in case
>> some
>> > accumulation is happening, or state needs to get set up...
>> >
>> > not sure about "splitting" into a table. Maybe more like
>> >
>> > A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
>> > A_ERRORS;
>> >
>> > "use" and "into" are optional syntactic sugar.
>> >
>> > This allows us to do any combination of:
>> > - die
>> > - put original record into a table
>> > - process the error using a custom handler (which can increment counters,
>> > write to dbs, send tweets... definitely send tweets...)
>> >
>> > D
>> >
>> > On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <led...@yahoo-inc.com
>> > >wrote:
>> >
>> > > That would be nice.
>> > > Also letting the error handler output the result to a relation would be
>> > > useful.
>> > > (To let the script output application error metrics)
>> > > For example it could (optionally) use the keyword INTO just like the
>> > SPLIT
>> > > operator.
>> > >
>> > > FOO = LOAD ...;
>> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>> > >
>> > > ErrorHandler would look a little more like EvalFunc:
>> > >
>> > > public interface ErrorHandler<T> {
>> > >
>> > >  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
>> > > IOException;
>> > >
>> > > public Schema outputSchema(Schema input);
>> > >
>> > > }
>> > >
>> > > There could be a built-in handler to output the skipped record (input:
>> > > tuple, funcname:chararray, errorMessage:chararray)
>> > >
>> > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>> > >
>> > > Julien
>> > >
>> > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:
>> > >
>> > > I was thinking about this..
>> > >
>> > > We add an optional ON_ERROR clause to operators, which allows a user to
>> > > specify error handling. The error handler would be a udf that would
>> > > implement an interface along these lines:
>> > >
>> > > public interface ErrorHandler {
>> > >
>> > >  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input)
>> > throws
>> > > IOException;
>> > >
>> > > }
>> > >
>> > > I think this makes sense not to make a static method so that users
>> could
>> > > keep required state, and for example have the handler throw its own
>> > > IOException of it's been invoked too many times.
>> > >
>> > > D
>> > >
>> > >
>> > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <
>> s...@yahoo-inc.com
>> > > >wrote:
>> > >
>> > > > Thanks for the clarification Ashutosh.
>> > > >
>> > > > Implementing this in the user realm is tricky as Dmitriy states.
>> > > > Sensitivity to error thresholds will require support from the system.
>> > We
>> > > can
>> > > > probably provide a taxonomy of records (good, bad, incomplete, etc.)
>> to
>> > > let
>> > > > users classify each record. The system can then track counts of each
>> > > record
>> > > > type to facilitate the computation of thresholds. The last part is to
>> > > allow
>> > > > users to specify thresholds and appropriate actions (interrupt, exit,
>> > > > continue, etc.). A possible mechanism to realize this is the
>> > > > ErrorHandlingUDF described by Dmitriy.
>> > > >
>> > > > Santhosh
>> > > >
>> > > > -----Original Message-----
>> > > > From: Ashutosh Chauhan [mailto:hashut...@apache.org]
>> > > > Sent: Friday, January 14, 2011 7:35 PM
>> > > > To: u...@pig.apache.org
>> > > > Subject: Re: Exception Handling in Pig Scripts
>> > > >
>> > > > Santhosh,
>> > > >
>> > > > The way you are proposing, it will kill the pig script. I think what
>> > user
>> > > > wants is to ignore few "bad records" and to process the rest and get
>> > > > results. Problem here is how to let user tell Pig the definition of
>> > "bad
>> > > > record" and how to let him specify threshold for % of bad records at
>> > > which
>> > > > Pig should fail the script.
>> > > >
>> > > > Ashutosh
>> > > >
>> > > > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <
>> s...@yahoo-inc.com>
>> > > > wrote:
>> > > > > Sorry about the late response.
>> > > > >
>> > > > > Hadoop n00b is proposing a language extension for error handling,
>> > > similar
>> > > > to the mechanisms in other well known languages like C++, Java, etc.
>> > > > >
>> > > > > For now, can't the error semantics be handled by the UDF? For
>> > > exceptional
>> > > > scenarios you could throw an ExecException with the right details.
>> The
>> > > > physical operator that handles the execution of UDF's traps it for
>> you
>> > > and
>> > > > propagates the error back to the client. You can take a look at any
>> of
>> > > the
>> > > > builtin UDFs to see how Pig handles it internally.
>> > > > >
>> > > > > Santhosh
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
>> > > > > Sent: Tuesday, January 11, 2011 10:41 AM
>> > > > > To: u...@pig.apache.org
>> > > > > Subject: Re: Exception Handling in Pig Scripts
>> > > > >
>> > > > > Right now error handling is controlled by the UDFs themselves, and
>> > > there
>> > > > is no way to direct it externally.
>> > > > > You can make an ErrorHandlingUDF that would take a udf spec, invoke
>> > it,
>> > > > trap errors, and then do the specified error handling behavior..
>> that's
>> > a
>> > > > bit ugly though.
>> > > > >
>> > > > > There is a problem with trapping general exceptions of course, in
>> > that
>> > > if
>> > > > they happen 0.000001% of the time you can probably just ignore them,
>> > but
>> > > if
>> > > > they happen in half your dataset, you want the job to tell you
>> > something
>> > > is
>> > > > wrong. So this stuff gets non-trivial. If anyone wants to propose a
>> > > design
>> > > > to solve this general problem, I think that would be a welcome
>> > addition.
>> > > > >
>> > > > > D
>> > > > >
>> > > > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2h...@gmail.com>
>> > > > wrote:
>> > > > >
>> > > > >> Thanks, I sometimes get a date like 0001-01-01. This would be a
>> > valid
>> > > > >> date format, but when I try to get the seconds between this and
>> > > > >> another date, say 2011-01-01, I get an error that the value is too
>> > > > >> large to be fit into int and the process stops. Do we have
>> something
>> > > > >> like ifError(x-y, null,x-y)? Or would I have to implement this as
>> an
>> > > > >> UDF?
>> > > > >>
>> > > > >> Thanks
>> > > > >>
>> > > > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <
>> > dvrya...@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Create a UDF that verifies the format, and go through a
>> filtering
>> > > > >> > step first.
>> > > > >> > If you would like to save the malformated records so you can
>> look
>> > > > >> > at them later, you can use the SPLIT operator to route the good
>> > > > >> > records to your regular workflow, and the bad records some place
>> > on
>> > > > HDFS.
>> > > > >> >
>> > > > >> > -D
>> > > > >> >
>> > > > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <
>> new2h...@gmail.com>
>> > > > wrote:
>> > > > >> >
>> > > > >> > > Hello,
>> > > > >> > >
>> > > > >> > > I have a pig script that uses piggy bank to calculate date
>> > > > differences.
>> > > > >> > > Sometimes, when I get a wierd date or wrong format in the
>> input,
>> > > > >> > > the
>> > > > >> > script
>> > > > >> > > throws and error and aborts.
>> > > > >> > >
>> > > > >> > > Is there a way I could trap these errors and move on without
>> > > > >> > > stopping
>> > > > >> the
>> > > > >> > > execution?
>> > > > >> > >
>> > > > >> > > Thanks
>> > > > >> > >
>> > > > >> > > PS: I'm using CDH2 with Pig 0.5
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> >
>> >
>>
>>
>

Reply via email to