[
https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239193#comment-13239193
]
Jonathan Coveney commented on PIG-2614:
---------------------------------------
Russell,
In Elephant-bird, there is a key
elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like
doing this, so find attached a patch that adds the functionality you want (note
that it includes PIG-2551, which is more or less good to go... only because
that patch brings in a Counter helper).
The default functionality does not change. On an error, it will die. However,
there are not two keys that can be set:
pig.piggybank.storage.avro.bad.record.threshold
pig.piggybank.storage.avro.bad.record.min
The former sets the acceptable ratio threshhold. The latter sets the minimum
number of errors before it can error out.
Here is where you come in:
Currently, the only error I log is on "reader.next()." Are there any other
cases where errors (at least, errors indicating a bad row) can be thrown? And
on an error, what do you want to happen? Skip the row, or return null? It seems
to make sense to me to skip the record (also, the number of records processed
and the number of errors thrown is logged in a Hadoop counter now).
Secondly, someone needs to make tests. It currently passes the tests, but
that's because the default threshold and min are 0. I don't know what is and
isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work
implementing will motivate someone to add tests ;)
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
> Key: PIG-2614
> URL: https://issues.apache.org/jira/browse/PIG-2614
> Project: Pig
> Issue Type: Bug
> Components: piggybank
> Affects Versions: 0.10, 0.11
> Reporter: Russell Jurney
> Priority: Blocker
> Labels: avro, avrostorage, bad, book, cutting, doug, for, my,
> pig, sadism
> Fix For: 0.10, 0.11
>
> Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing
> fields. This is very bad on 'big data,' where bad records are inevitable.
> See discussion at
> http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss
> for more theory.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira