[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

Rohini Palaniswamy (JIRA) Thu, 14 Apr 2016 14:57:22 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241989#comment-15241989
 ]


Rohini Palaniswamy commented on HIVE-13509:
-------------------------------------------

IMHO, Hive should also be throwing an error as well if data does not exist 
because the results returned is incomplete and wrong. Data integrity is 
important. If some users are ok with it, then it can be a configurable option 
for them but it cannot be the default (at least with Pig). For eg: 
mapred.max.map.failures.percent and mapred.max.reduce.failures.percent are 
useful for users who are ok with tolerating some amount of failure, but default 
is 0.  Same with pig.error.threshold.percent. 

> HCatalog getSplits should ignore the partition with invalid path
> ----------------------------------------------------------------
>
>                 Key: HIVE-13509
>                 URL: https://issues.apache.org/jira/browse/HIVE-13509
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Chaoyu Tang
>            Assignee: Chaoyu Tang
>         Attachments: HIVE-13509.patch
>
>
> It is quite common that there is the discrepancy between partition directory 
> and its HMS metadata, simply because the directory could be added/deleted 
> externally using hdfs shell command. Technically it should be fixed by MSCK 
> and alter table .. add/drop command etc, but sometimes it might not be 
> practical especially in a multi-tenant env. This discrepancy does not cause 
> any problem to Hive, Hive returns no rows for a partition with an invalid 
> (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because 
> the HCatBaseInputFormat getSplits throws an error when getting a split for a 
> non-existing path. The error message might looks like:
> {code}
> Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
> not exist: 
> hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR
>       at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>       at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>       at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>       at 
> org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162)
>       at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path

Reply via email to