Tom, Agreed, this is a third party reader operating on a custom data format. Neither of which I control. The error is happening in the reader and I'm trying to isolate the issue in order to do proper handling.
Thanks! Justin On Thu, Oct 13, 2011 at 5:31 PM, Tom White <t...@cloudera.com> wrote: > Justin, > > The skipping feature should really only be used when you are calling > out to a third-party library that may segfault on corrupt data, and > even then it's probably better to use a subprocess to handles it, as > Owen suggested here: > http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201108.mbox/%3ccafqou9ekv+sbvav-bsf5dorjo68vsj6ztqxywwut+qhs3v3...@mail.gmail.com%3e. > > In other cases you should handle the corrupt data in your mapper or > reducer, by catching the relevant exception, for example. > > Tom > > On Thu, Oct 13, 2011 at 5:41 AM, Justin Woody <justin.wo...@gmail.com> wrote: >> Harsh, >> >> Thanks for the info. If I get some time maybe I can assist. I'm >> looking over your code now. For now I am failing the files with the >> mapred.max.map.failures.percent property, but I'm losing a lot of good >> data going that route. >> >> Justin >> >> On Wed, Oct 12, 2011 at 4:27 PM, Harsh J <ha...@cloudera.com> wrote: >>> Justin, >>> >>> Unfortunately not. The new API does not have a skipping feature yet >>> like the older one. >>> >>> I did get started on some work on >>> https://issues.apache.org/jira/browse/MAPREDUCE-1932 to fix this but I >>> haven't been able to find time to complete it with proper tests and >>> such. I'll try to do it within a week from now. >>> >>> On Wed, Oct 12, 2011 at 10:06 PM, Justin Woody <justin.wo...@gmail.com> >>> wrote: >>>> Can anyone confirm whether the skip options work for MR jobs using the >>>> new API? I have a job using the new API and I cannot get the job to >>>> skip corrupted records. I tried configuring job properties manually >>>> and using the SkipBadRecords class. >>>> >>>> Thanks, >>>> Justin >>>> >>> >>> >>> >>> -- >>> Harsh J >>> >> >