[jira] Commented: (HIVE-1419) Policy on deserialization errors
[ https://issues.apache.org/jira/browse/HIVE-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902563#action_12902563 ] Vladimir Klimontovich commented on HIVE-1419: - Namit, yes I can refresh the patch in next few days. > Policy on deserialization errors > > > Key: HIVE-1419 > URL: https://issues.apache.org/jira/browse/HIVE-1419 > Project: Hadoop Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 0.5.0 >Reporter: Vladimir Klimontovich >Assignee: Vladimir Klimontovich >Priority: Minor > Fix For: 0.7.0 > > Attachments: corrupted_records_0.5.patch, > corrupted_records_0.5_ver2.patch, corrupted_records_trunk.patch, > corrupted_records_trunk_ver2.patch > > > When deserializer throws an exception the whole map tasks fails (see > MapOperator.java file). It's not always an convenient behavior especially on > huge datasets where several corrupted lines could be a normal practice. > Proposed solution: > 1) Have a counter of corrupted records > 2) When a counter exceeds a limit (configurable via > hive.max.deserializer.errors property, 0 by default) throw an exception. > Otherwise just log and exception with WARN level. > Patches for 0.5 branch and trunk are attached -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1419) Policy on deserialization errors
[ https://issues.apache.org/jira/browse/HIVE-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902559#action_12902559 ] Namit Jain commented on HIVE-1419: -- @Vladimir/Edward, Are you still working on this ? Seems like a good feature to have - Oracle sqlldr has very similar functionality where you can specify the number of bad rows. Can you refresh the patch ? > Policy on deserialization errors > > > Key: HIVE-1419 > URL: https://issues.apache.org/jira/browse/HIVE-1419 > Project: Hadoop Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 0.5.0 >Reporter: Vladimir Klimontovich >Assignee: Vladimir Klimontovich >Priority: Minor > Fix For: 0.7.0 > > Attachments: corrupted_records_0.5.patch, > corrupted_records_0.5_ver2.patch, corrupted_records_trunk.patch, > corrupted_records_trunk_ver2.patch > > > When deserializer throws an exception the whole map tasks fails (see > MapOperator.java file). It's not always an convenient behavior especially on > huge datasets where several corrupted lines could be a normal practice. > Proposed solution: > 1) Have a counter of corrupted records > 2) When a counter exceeds a limit (configurable via > hive.max.deserializer.errors property, 0 by default) throw an exception. > Otherwise just log and exception with WARN level. > Patches for 0.5 branch and trunk are attached -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1419) Policy on deserialization errors
[ https://issues.apache.org/jira/browse/HIVE-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880909#action_12880909 ] Vladimir Klimontovich commented on HIVE-1419: - If it works fine for you now, it won't be broken by this patch. > Policy on deserialization errors > > > Key: HIVE-1419 > URL: https://issues.apache.org/jira/browse/HIVE-1419 > Project: Hadoop Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 0.5.0 >Reporter: Vladimir Klimontovich >Assignee: Vladimir Klimontovich >Priority: Minor > Fix For: 0.5.1, 0.6.0 > > Attachments: corrupted_records_0.5.patch, > corrupted_records_0.5_ver2.patch, corrupted_records_trunk.patch, > corrupted_records_trunk_ver2.patch > > > When deserializer throws an exception the whole map tasks fails (see > MapOperator.java file). It's not always an convenient behavior especially on > huge datasets where several corrupted lines could be a normal practice. > Proposed solution: > 1) Have a counter of corrupted records > 2) When a counter exceeds a limit (configurable via > hive.max.deserializer.errors property, 0 by default) throw an exception. > Otherwise just log and exception with WARN level. > Patches for 0.5 branch and trunk are attached -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1419) Policy on deserialization errors
[ https://issues.apache.org/jira/browse/HIVE-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880908#action_12880908 ] Edward Capriolo commented on HIVE-1419: --- I am looking through this and trying to wrap my head around it. Off hand do you know what happens in this situation. We have a table that we have added columns to over time create table tab (a int, b int); Over time we have added more columns alter table tab (a int, b int, c int) This works fine for us as selecting column c on older data returns null for that column. Will this behaviour be preserved ? > Policy on deserialization errors > > > Key: HIVE-1419 > URL: https://issues.apache.org/jira/browse/HIVE-1419 > Project: Hadoop Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 0.5.0 >Reporter: Vladimir Klimontovich >Assignee: Vladimir Klimontovich >Priority: Minor > Fix For: 0.5.1, 0.6.0 > > Attachments: corrupted_records_0.5.patch, > corrupted_records_0.5_ver2.patch, corrupted_records_trunk.patch, > corrupted_records_trunk_ver2.patch > > > When deserializer throws an exception the whole map tasks fails (see > MapOperator.java file). It's not always an convenient behavior especially on > huge datasets where several corrupted lines could be a normal practice. > Proposed solution: > 1) Have a counter of corrupted records > 2) When a counter exceeds a limit (configurable via > hive.max.deserializer.errors property, 0 by default) throw an exception. > Otherwise just log and exception with WARN level. > Patches for 0.5 branch and trunk are attached -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.