[ 
https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541440#comment-13541440
 ] 

Russell Jurney commented on PIG-3059:
-------------------------------------

Is it possible at all to provide an optional interface that counts bad records, 
as well?

I don't understand at the moment the issues around resuming after bad records. 
I believe some readers can handle one bad record and continue. I think 
elephant-bird works this way with some formats? That being the case, if 
possible, we should accommodate bad record counts, as well as bad inputsplits, 
in the reporting. Not sure if that makes sense, but check this out: 
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/LzoRecordReader.java

{noformat}    
    void incErrors(Throwable cause) {
      numErrors++;
      if (numErrors > numRecords) {
        // incorrect use of this class
        throw new RuntimeException("Forgot to invoke incRecords()?");
      }

      if (cause == null) {
        cause = new Exception("Unknown error");
      }

      if (errorThreshold <= 0) { // no errors are tolerated
        throw new RuntimeException("error while reading input records", cause);
      }

      LOG.warn("Error while reading an input record ("
          + numErrors + " out of " + numRecords + " so far ): ", cause);

      double errRate = numErrors/(double)numRecords;

      // will always excuse the first error. We can decide if single
      // error crosses threshold inside close() if we want to.
      if (numErrors >= minErrors  && errRate > errorThreshold) {
        LOG.error(numErrors + " out of " + numRecords
            + " crosses configured threshold (" + errorThreshold + ")");
        throw new RuntimeException("error rate while reading input records 
crossed threshold", cause);
      }
    }
{noformat}

Also, when we report a bad split, is it possible to say how large that split 
is? If we could report the total size and the size of the inputsplits lost, 
that gives a user better context as to the overall %. If it is hard to 
implement the total size of the job, a user might compare the size of the lost 
inputsplits with 'hadoop fs -ls /my/input' and see the size difference?
                
> Global configurable minimum 'bad record' thresholds
> ---------------------------------------------------
>
>                 Key: PIG-3059
>                 URL: https://issues.apache.org/jira/browse/PIG-3059
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Russell Jurney
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>
>         Attachments: avro_test_files-2.tar.gz, PIG-3059-2.patch, 
> PIG-3059.patch
>
>
> See PIG-2614. 
> Pig dies when one record in a LOAD of a billion records fails to parse. This 
> is almost certainly not the desired behavior. elephant-bird and some other 
> storage UDFs have minimum thresholds in terms of percent and count that must 
> be exceeded before a job will fail outright.
> We need these limits to be configurable for Pig, globally. I've come to 
> realize what a major problem Pig's crashing on bad records is for new Pig 
> users. I believe this feature can greatly improve Pig.
> An example of a config would look like:
> pig.storage.bad.record.threshold=0.01
> pig.storage.bad.record.min=100
> A thorough discussion of this issue is available here: 
> http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to