[ 
https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540637#comment-13540637
 ] 

Cheolsoo Park commented on PIG-3059:
------------------------------------

I think it depends on file format. But for Avro, one case that we should handle 
is when a sync() call throws an exception. In this case, we can't really find 
the next position where we can resume the read. Given that we're implementing 
this logic in PigRecordReader (a wrapper class for underlying record readers), 
I don't think that skipping records not splits is always possible. Please 
correct me if I am wrong.

Thanks!
                
> Global configurable minimum 'bad record' thresholds
> ---------------------------------------------------
>
>                 Key: PIG-3059
>                 URL: https://issues.apache.org/jira/browse/PIG-3059
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Russell Jurney
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>
>         Attachments: PIG-3059.patch, test_avro_files.tar.gz
>
>
> See PIG-2614. 
> Pig dies when one record in a LOAD of a billion records fails to parse. This 
> is almost certainly not the desired behavior. elephant-bird and some other 
> storage UDFs have minimum thresholds in terms of percent and count that must 
> be exceeded before a job will fail outright.
> We need these limits to be configurable for Pig, globally. I've come to 
> realize what a major problem Pig's crashing on bad records is for new Pig 
> users. I believe this feature can greatly improve Pig.
> An example of a config would look like:
> pig.storage.bad.record.threshold=0.01
> pig.storage.bad.record.min=100
> A thorough discussion of this issue is available here: 
> http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to