There are certain class of errors (out of memory types) that cannot be handled 
within Hive. For such cases - doing it in Hadoop would make sense. The other 
case is handling errors in user scripts. This is especially tricky - and we 
would need to borrow/use hadoop techniques for retry during the same.

However - out of memory exceptions are rare - and from what we have seen - when 
they do happen - it's not possible to fix them by retrying (for example joins 
end up consuming too much memory). We have a controlled execution engine. If 
the deserializers don't barf on the input (which is also possible - sometimes a 
deserializer will try to allocate a large string and die) - then the execution 
engine should not get errors other than regular exceptions.

So most errors are regular exceptions that can be caught and the errors ignored 
and reported in the job counter or the job failed as requested. We should do 
this as a first step.

________________________________
From: Qing Yan [mailto:qing...@gmail.com]
Sent: Thursday, February 19, 2009 7:34 PM
To: hive-user@hadoop.apache.org
Subject: Re: Error input handling in Hive

Hi Zheng,

I have opened a Jira(HIVE295).

IMHO there are three steps errors can be handled:

1) Always fail. One bad record and whole job fails which is the current Hive 
behavior.

2) Always success. Ignoring bad records(save them somewhere to allow further 
analysis) and job still successes.

3) Success with condition. Something in the middle ground as you described.

What can be done is make this configurable and let the user decide which 
setting is appropiate for his application.

In practice I would image 2) will be most common case(e.g.0.1% error rate).
BTW Just curious since you guys already use Hive in prod, how you guarantee the 
input is 100% given Hive itself doesn't
do any checking by itself.

One thing I wasn't sure is whether the error handling logic should better 
belong to the hive layer or the hadoop layer.

Hadoop 0.19 already support 2) 
http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html#Skipping+Bad+Records
and may support 3)  in the future.

So the blackbox way is for Hive to just expose those API calls or as a general 
approach allow user add "aspect" to the JobConf
object. Is this allowed in Hive design?


Regards,

Qing

On Thu, Feb 19, 2009 at 5:59 PM, Zheng Shao 
<zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote:
Hi Qing,

That's a good idea. Can you open a jira?
There are lots of details before we can add that feature to Hive. For
example, how to specify the largest number of data corruption that can
be accepted, by absolute number or percentage, etc. What about half
corrupted records in case we only need the non-corrupted part in the
query, etc.


Zheng



On 2/19/09, Qing Yan <qing...@gmail.com<mailto:qing...@gmail.com>> wrote:
> Say I have some bad/ill-formatted records in the input, is there a way to
> configure the default Hive parser to discard those records directly(e.g.
> when a integer column get a string)?
>
> Besides, is the new skip-bad-records feature in 0.19 accessible in Hive?
> It is a quite handy feature in the real world.
>
> What I see so far is the Hive parser throws exception and cause the whole
> job to fail ultimately.
>
> Thanks for the help!
>
> Qing
>
--
Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com/>

Yours,
Zheng

Reply via email to