[ 
https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542691#comment-13542691
 ] 

Dmitriy V. Ryaboy commented on PIG-3059:
----------------------------------------

I agree with the principle that inspired this patch, but the solution seems to 
fall short of ideal.

Dealing in splits is misleading and hard to reason about:

* Good records read from a split that contains a bad record still get 
processed, so it's not the case that a "bad split" is ignored, and you are 
controlling how many bad splits to ignore.
* a single bad record stops the whole *rest* of the split from being processed, 
whether your loader could recover or not. This is unnecessary data loss.
* most users of pig have no idea (or should have no idea) what a split is
* Pig combines splits -- but this deals with pre-combination splits. Especially 
when combining small but unequal files, splits are very different from each 
other, and some may contain 100 records while others contain 100,000 records.

This all means that no matter what the user sets these values to, they actually 
have no idea what error threshold they are telling Pig to ignore. 

I think the Elephant-Bird way of dealing with errors -- minimal threshold of 
*record* errors + a percentage of total *records* read -- is quite robust and 
easy to explain. If Avro can't recover from a bad record in a single split, it 
can do whatever is appropriate for avro -- estimate how many records it's 
dropping and throw that many exceptions, or just pretend that this one error is 
all that was left in the split, or maybe fix the format so that it can recover 
properly (ok, that was a troll comment :)).

                
> Global configurable minimum 'bad record' thresholds
> ---------------------------------------------------
>
>                 Key: PIG-3059
>                 URL: https://issues.apache.org/jira/browse/PIG-3059
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Russell Jurney
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>
>         Attachments: avro_test_files-2.tar.gz, PIG-3059-2.patch, 
> PIG-3059.patch
>
>
> See PIG-2614. 
> Pig dies when one record in a LOAD of a billion records fails to parse. This 
> is almost certainly not the desired behavior. elephant-bird and some other 
> storage UDFs have minimum thresholds in terms of percent and count that must 
> be exceeded before a job will fail outright.
> We need these limits to be configurable for Pig, globally. I've come to 
> realize what a major problem Pig's crashing on bad records is for new Pig 
> users. I believe this feature can greatly improve Pig.
> An example of a config would look like:
> pig.storage.bad.record.threshold=0.01
> pig.storage.bad.record.min=100
> A thorough discussion of this issue is available here: 
> http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to