[ 
https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907058#action_12907058
 ] 

He Yongqiang commented on HIVE-1610:
------------------------------------

Sammy, there are mainly 2 problems. 
1) going over the map is not efficient, and 2) using startWith to do prefix 
match is a bug fixed in HIVE-1510.

Sammy, can you change the logic as follows:

right now, hive generates another pathToPartitionInfo map by removing the 
path's schema information, and put it in a cacheMap. 
We can keep the same logic but change the new pathToPartitionInfo map's value 
to be an array of PartitionDesc. 
And then we can just remove the schema check, and once we get a match, we go 
through the array of PartitionDesc to find the best one.

this can also solve another problem. If there are 2 partitionDesc which's path 
part is same but the schema is different, only one is contained in the new 
pathToPartitionInfo map. 

About how to go through the array of PartitionDesc to find the best one:
if the array contains only 1 element, return array.get(0);
1) if the original input does not have any schema information:  if the array 
contains more then 1 element, report error.
2) if the original input contains schema information: 1) if the array contains 
an element which's the exact match (also contains schema and port, and the same 
with input), return the exact match. 2) ignore port part but keep the schema 
and address, and go through the array. 

what do you think?

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 
> 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 
> 0003-HIVE-1610.patch, 0004-hive.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select 
> distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, 
> keywords.universal_rank, keywords.serp_type, keywords.date_indexed, 
> keywords.search_engine_type, keywords.week from keyword_serp_results keywords 
> JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, 
> min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, 
> keywords1.search_engine_type,  keywords1.week, keywords1.rank, 
> dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN 
> (select domain, keyword, search_engine_type, week, max(date_indexed) as 
> max_date_indexed from keyword_serp_results group by 
> domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = 
> dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND 
> keywords1.search_engine_type = dupkeywords1.search_engine_type AND 
> keywords1.week = dupkeywords1.week AND keywords1.date_indexed = 
> dupkeywords1.max_date_indexed) dupkeywords2 group by 
> domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on 
> keywords.keyword = dupkeywords3.keyword AND  keywords.domain = 
> dupkeywords3.domain AND keywords.search_engine_type = 
> dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND 
> keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = 
> dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started 
> getting this error:
> java.io.IOException: cannot find dir = 
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0
>  in 
> partToPartitionInfo: 
> [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at 
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the 
> changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to