Thats a nice approach. It fits my
Unix-solved-everything-but-needs-syntactic-sugar world-view :-) (e.g. if we had
a 1| and 2| syntax, this would be:
0<./FOO "1| ./bar > A" "2| ./MyHandler > B"
:-)
- milind
On Jan 18, 2011, at 10:27 AM, Julien Le Dem wrote:
> That would be nice.
> Also letting the error handler output the result to a relation would be
> useful.
> (To let the script output application error metrics)
> For example it could (optionally) use the keyword INTO just like the SPLIT
> operator.
>
> FOO = LOAD ...;
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>
> ErrorHandler would look a little more like EvalFunc:
>
> public interface ErrorHandler<T> {
>
> public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> public Schema outputSchema(Schema input);
>
> }
>
> There could be a built-in handler to output the skipped record (input: tuple,
> funcname:chararray, errorMessage:chararray)
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>
> Julien
>
> On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <[email protected]> wrote:
>
> I was thinking about this..
>
> We add an optional ON_ERROR clause to operators, which allows a user to
> specify error handling. The error handler would be a udf that would
> implement an interface along these lines:
>
> public interface ErrorHandler {
>
> public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> }
>
> I think this makes sense not to make a static method so that users could
> keep required state, and for example have the handler throw its own
> IOException of it's been invoked too many times.
>
> D
>
>
> On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan
> <[email protected]>wrote:
>
>> Thanks for the clarification Ashutosh.
>>
>> Implementing this in the user realm is tricky as Dmitriy states.
>> Sensitivity to error thresholds will require support from the system. We can
>> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
>> users classify each record. The system can then track counts of each record
>> type to facilitate the computation of thresholds. The last part is to allow
>> users to specify thresholds and appropriate actions (interrupt, exit,
>> continue, etc.). A possible mechanism to realize this is the
>> ErrorHandlingUDF described by Dmitriy.
>>
>> Santhosh
>>
>> -----Original Message-----
>> From: Ashutosh Chauhan [mailto:[email protected]]
>> Sent: Friday, January 14, 2011 7:35 PM
>> To: [email protected]
>> Subject: Re: Exception Handling in Pig Scripts
>>
>> Santhosh,
>>
>> The way you are proposing, it will kill the pig script. I think what user
>> wants is to ignore few "bad records" and to process the rest and get
>> results. Problem here is how to let user tell Pig the definition of "bad
>> record" and how to let him specify threshold for % of bad records at which
>> Pig should fail the script.
>>
>> Ashutosh
>>
>> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <[email protected]>
>> wrote:
>>> Sorry about the late response.
>>>
>>> Hadoop n00b is proposing a language extension for error handling, similar
>> to the mechanisms in other well known languages like C++, Java, etc.
>>>
>>> For now, can't the error semantics be handled by the UDF? For exceptional
>> scenarios you could throw an ExecException with the right details. The
>> physical operator that handles the execution of UDF's traps it for you and
>> propagates the error back to the client. You can take a look at any of the
>> builtin UDFs to see how Pig handles it internally.
>>>
>>> Santhosh
>>>
>>> -----Original Message-----
>>> From: Dmitriy Ryaboy [mailto:[email protected]]
>>> Sent: Tuesday, January 11, 2011 10:41 AM
>>> To: [email protected]
>>> Subject: Re: Exception Handling in Pig Scripts
>>>
>>> Right now error handling is controlled by the UDFs themselves, and there
>> is no way to direct it externally.
>>> You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
>> trap errors, and then do the specified error handling behavior.. that's a
>> bit ugly though.
>>>
>>> There is a problem with trapping general exceptions of course, in that if
>> they happen 0.000001% of the time you can probably just ignore them, but if
>> they happen in half your dataset, you want the job to tell you something is
>> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
>> to solve this general problem, I think that would be a welcome addition.
>>>
>>> D
>>>
>>> On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <[email protected]>
>> wrote:
>>>
>>>> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
>>>> date format, but when I try to get the seconds between this and
>>>> another date, say 2011-01-01, I get an error that the value is too
>>>> large to be fit into int and the process stops. Do we have something
>>>> like ifError(x-y, null,x-y)? Or would I have to implement this as an
>>>> UDF?
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <[email protected]>
>>>> wrote:
>>>>
>>>>> Create a UDF that verifies the format, and go through a filtering
>>>>> step first.
>>>>> If you would like to save the malformated records so you can look
>>>>> at them later, you can use the SPLIT operator to route the good
>>>>> records to your regular workflow, and the bad records some place on
>> HDFS.
>>>>>
>>>>> -D
>>>>>
>>>>> On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <[email protected]>
>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I have a pig script that uses piggy bank to calculate date
>> differences.
>>>>>> Sometimes, when I get a wierd date or wrong format in the input,
>>>>>> the
>>>>> script
>>>>>> throws and error and aborts.
>>>>>>
>>>>>> Is there a way I could trap these errors and move on without
>>>>>> stopping
>>>> the
>>>>>> execution?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> PS: I'm using CDH2 with Pig 0.5
>>>>>>
>>>>>
>>>>
>>>
>>
>
---
Milind Bhandarkar
[email protected]