Hey hive gurus - I recently had some issues getting Hive to process a partition with bad records, and am curious how others deal with this issue. From searching around, I learned Hive uses the MR-provided bad record skipping functionality, instead of doing anything specific about bad records.
The partition I processed was roughly 87GB, with around 600 million records. The job eventually completed (with 350 task failures) with these settings: set mapred.skip.mode.enabled=true; set mapred.map.max.attempts=100; set mapred.reduce.max.attempts=100; set mapred.skip.map.max.skip.records=30000; set mapred.skip.attempts.to.start.skipping=1; I believe this means 350 records (~0.0000005%) caused the job to initially fail? The code throwing the exception has a todo to discuss record deserialization errors<https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java#L508>. Has a discussion around natively handling bad records happened? As a comparison, Elephant-Bird handles some percent of bad records<https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/LzoRecordReader.java> without causing task failures. Thanks! Travis
