Thanks Namit. Filed as HIVE-1006 and HIVE-1007.
On 12/22/09 12:36 AM, "Namit Jain" <[email protected]> wrote: > Thanks David, > It would be very useful if you can file jiras and patches for the same. > > > Thanks, > -namit > > > On 12/21/09 6:58 PM, "David Lerman" <[email protected]> wrote: > >> Thanks Zheng. We're using trunk, r888452. >> >> We actually ended up making three changes to CombineHiveInputFormat.java to >> get it working in our environment. If these aren't known issues, let me >> know and I can file bugs and patches in Jira. >> >> 1. The issue mentioned below. Along the lines you mentioned, we fixed it >> by changing: >> >> combine.createPool(job, new CombineFilter(paths[i])); >> >> to: >> >> combine.createPool(job, new CombineFilter(new >> Path(paths[i].toUri().getPath()))); >> >> and then getting rid of the code that strips the "file:" in >> Hadoop20Shims.getInputPathsShim and having it just call >> CombineFileInputFormat.getInputPaths(job); >> >> 2. When HiveInputFormat.getPartitionDescFromPath was called from >> CombineHiveInputFormat, it was sometimes failing to return a matching >> partitionDesc which then caused an Exception down the line since the split >> didn't have an inputFormatClassName. The issue was that the path format >> used as the key in pathToPartitionInfo varies between stage - in the first >> stage it was the complete path as returned from the table definitions (eg. >> hdfs://server/path), and then in subsequent stages, it was the complete path >> with port (eg. hdfs://server:8020/path) of the result of the previous stage. >> This isn't a problem in HiveInputFormat since the directory you're looking >> up always uses the same format as the keys, but in CombineHiveInputFormat, >> you take that path and look up its children in the file system to get all >> the block information, and then use one of the returned paths to get the >> partition info -- and that returned path does not include the port. So, in >> any stage after the first, we were looking for a path without the port, but >> all the keys in the map contained a port, so we didn't find anything. >> >> Since I didn't fully understand the logic for when the port was included in >> the path and when it wasn't, my hack fix was just to give >> CombineHiveInputFormat its own implementation of getPartitionDescFromPath >> which just walks through partitionDesc and compares using just the path: >> >> protected static partitionDesc getPartitionDescFromPath(Map<String, >> partitionDesc> pathToPartitionInfo, Path dir) >> throws IOException { >> for (Map.Entry<String, partitionDesc> entry : >> pathToPartitionInfo.entrySet()) { >> try { >> if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) { >> return entry.getValue(); >> } >> } catch (URISyntaxException e2) {} >> } >> throw new IOException("cannot find dir = " + dir.toString() >> >> + " in partToPartitionInfo!"); >> } >> >> 3. In a multi-stage query, when one stage returned no data (resulting in a >> bunch of output files with size 0), the next stage would hang in Hadoop >> because it would have 0 mappers in the job definition. The issue was that >> CombineHiveInputFormat would look for blocks, find none, and return 0 splits >> which would hang Hadoop. There may be good a way to just skip that job >> altogether, but as a quick hack to get it working, when there were no >> splits, I'd just create a single empty one so that the job wouldn't hang: at >> the end of getSplits, I just added: >> >> if (result.size() == 0) { >> Path firstChild = >> paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath(); >> >> CombineFileSplit emptySplit = new CombineFileSplit( >> job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l}, >> new String[0]); >> FixedCombineHiveInputSplit emptySplitWrapper = >> new FixedCombineHiveInputSplit(job, >> newHadoop20Shims.InputSplitShim(emptySplit)); >> >> result.add(emptySplitWrapper); >> } >> >> With those three changes, it's working beautifully -- some of our queries >> which previously had thousands of mappers loading very small data files now >> have a hundred or so and are running about 10x faster. Many thanks for the >> new functionality! >> >> On 12/21/09 2:44 AM, "Zheng Shao" <[email protected]> wrote: >> >>>> Sorry about the delay. >>>> >>>> Are you using Hive trunk? >>>> >>>> Filed https://issues.apache.org/jira/browse/HIVE-1001 >>>> We should use (new Path(str)).getPath() instead of chopping off the >>>> first 5 chars. >>>> >>>> Zheng >>>> >>>> On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <[email protected]> wrote: >>>>>> I'm running into errors where CombinedHiveInputFormat is combining data >>>>>> >>> from >>>>>> two different tables which is causing problems because the tables have >>>>>> different input formats. >>>>>> >>>>>> It looks like the problem is in >>>>>> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim. It calls >>>>>> CombineFileInputFormat.getInputPaths which returns the list of input >>>> paths >>>>>> and then chops off the first 5 characters to remove file: from the >>>>>> beginning, but the return value I'm getting from getInputPaths is >>>> actually >>>>>> hdfs://domain/path. So then when it creates the pools using these paths, >>>>>> none of the input paths match the pools (since they're just the file path >>>>>> which protocol or domain). >>>>>> >>>>>> Any suggestions? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>> >>>> >>>> >>>> -- >>>> Yours, >>>> Zheng >> >>
