[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241989#comment-15241989 ]
Rohini Palaniswamy commented on HIVE-13509: ------------------------------------------- IMHO, Hive should also be throwing an error as well if data does not exist because the results returned is incomplete and wrong. Data integrity is important. If some users are ok with it, then it can be a configurable option for them but it cannot be the default (at least with Pig). For eg: mapred.max.map.failures.percent and mapred.max.reduce.failures.percent are useful for users who are ok with tolerating some amount of failure, but default is 0. Same with pig.error.threshold.percent. > HCatalog getSplits should ignore the partition with invalid path > ---------------------------------------------------------------- > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog > Reporter: Chaoyu Tang > Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)