[jira] [Updated] (HCATALOG-487) HCatalog should tolerate a user-defined amount of bad records

Travis Crawford (JIRA) Thu, 30 Aug 2012 15:58:12 -0700

     [ 
https://issues.apache.org/jira/browse/HCATALOG-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Travis Crawford updated HCATALOG-487:
-------------------------------------

    Attachment: HCATALOG-487_skip_bad_records.1.patch

This patch adds two new properties that control bad record skipping:

{code}
+  /**
+   * {@value} (default: {@value #HCAT_INPUT_BAD_RECORD_THRESHOLD_DEFAULT}).
+   * Threshold for the ratio of bad records that will be silently skipped 
without causing a task
+   * failure. This is useful when processing large data sets with corrupt 
records, when its
+   * acceptable to skip some bad records.
+   */
+  public static final String HCAT_INPUT_BAD_RECORD_THRESHOLD_KEY = 
"hcat.input.bad.record.threshold";
+  public static final float HCAT_INPUT_BAD_RECORD_THRESHOLD_DEFAULT = 0.0001f;
+
+  /**
+   * {@value} (default: {@value #HCAT_INPUT_BAD_RECORD_MIN_DEFAULT}).
+   * Number of bad records that will be accepted before applying
+   * {@value #HCAT_INPUT_BAD_RECORD_THRESHOLD_KEY}. This is necessary to 
prevent an initial bad
+   * record from causing a task failure.
+   */
+  public static final String HCAT_INPUT_BAD_RECORD_MIN_KEY = 
"hcat.input.bad.record.min";
+  public static final int HCAT_INPUT_BAD_RECORD_MIN_DEFAULT = 2;
{code}
                
> HCatalog should tolerate a user-defined amount of bad records
> -------------------------------------------------------------
>
>                 Key: HCATALOG-487
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-487
>             Project: HCatalog
>          Issue Type: Improvement
>            Reporter: Travis Crawford
>            Assignee: Travis Crawford
>         Attachments: HCATALOG-487_skip_bad_records.1.patch
>
>
> HCatalog tasks currently fail when deserializing corrupt records. In some 
> cases, large data sets have a small number of corrupt records and its okay to 
> skip them. In fact Hadoop has support for skipping bad records for exactly 
> this reason.
> However, using the Hadoop-native record skipping feature (like Hive does) is 
> very coarse and leads to a large number of failed tasks, task scheduling 
> overhead, and limited control over the skipping behavior.
> HCatalog should have native support for skipping a user-defined amount of bad 
> records.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HCATALOG-487) HCatalog should tolerate a user-defined amount of bad records

Reply via email to