There are certain class of errors (out of memory types) that cannot be handled within Hive. For such cases - doing it in Hadoop would make sense. The other case is handling errors in user scripts. This is especially tricky - and we would need to borrow/use hadoop techniques for retry during the same.
However - out of memory exceptions are rare - and from what we have seen - when they do happen - it's not possible to fix them by retrying (for example joins end up consuming too much memory). We have a controlled execution engine. If the deserializers don't barf on the input (which is also possible - sometimes a deserializer will try to allocate a large string and die) - then the execution engine should not get errors other than regular exceptions. So most errors are regular exceptions that can be caught and the errors ignored and reported in the job counter or the job failed as requested. We should do this as a first step. ________________________________ From: Qing Yan [mailto:qing...@gmail.com] Sent: Thursday, February 19, 2009 7:34 PM To: hive-user@hadoop.apache.org Subject: Re: Error input handling in Hive Hi Zheng, I have opened a Jira(HIVE295). IMHO there are three steps errors can be handled: 1) Always fail. One bad record and whole job fails which is the current Hive behavior. 2) Always success. Ignoring bad records(save them somewhere to allow further analysis) and job still successes. 3) Success with condition. Something in the middle ground as you described. What can be done is make this configurable and let the user decide which setting is appropiate for his application. In practice I would image 2) will be most common case(e.g.0.1% error rate). BTW Just curious since you guys already use Hive in prod, how you guarantee the input is 100% given Hive itself doesn't do any checking by itself. One thing I wasn't sure is whether the error handling logic should better belong to the hive layer or the hadoop layer. Hadoop 0.19 already support 2) http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html#Skipping+Bad+Records and may support 3) in the future. So the blackbox way is for Hive to just expose those API calls or as a general approach allow user add "aspect" to the JobConf object. Is this allowed in Hive design? Regards, Qing On Thu, Feb 19, 2009 at 5:59 PM, Zheng Shao <zsh...@gmail.com<mailto:zsh...@gmail.com>> wrote: Hi Qing, That's a good idea. Can you open a jira? There are lots of details before we can add that feature to Hive. For example, how to specify the largest number of data corruption that can be accepted, by absolute number or percentage, etc. What about half corrupted records in case we only need the non-corrupted part in the query, etc. Zheng On 2/19/09, Qing Yan <qing...@gmail.com<mailto:qing...@gmail.com>> wrote: > Say I have some bad/ill-formatted records in the input, is there a way to > configure the default Hive parser to discard those records directly(e.g. > when a integer column get a string)? > > Besides, is the new skip-bad-records feature in 0.19 accessible in Hive? > It is a quite handy feature in the real world. > > What I see so far is the Hive parser throws exception and cause the whole > job to fail ultimately. > > Thanks for the help! > > Qing > -- Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com/> Yours, Zheng