Thanks David, It would be very useful if you can file jiras and patches for the same.
Thanks, -namit On 12/21/09 6:58 PM, "David Lerman" <[email protected]> wrote: Thanks Zheng. We're using trunk, r888452. We actually ended up making three changes to CombineHiveInputFormat.java to get it working in our environment. If these aren't known issues, let me know and I can file bugs and patches in Jira. 1. The issue mentioned below. Along the lines you mentioned, we fixed it by changing: combine.createPool(job, new CombineFilter(paths[i])); to: combine.createPool(job, new CombineFilter(new Path(paths[i].toUri().getPath()))); and then getting rid of the code that strips the "file:" in Hadoop20Shims.getInputPathsShim and having it just call CombineFileInputFormat.getInputPaths(job); 2. When HiveInputFormat.getPartitionDescFromPath was called from CombineHiveInputFormat, it was sometimes failing to return a matching partitionDesc which then caused an Exception down the line since the split didn't have an inputFormatClassName. The issue was that the path format used as the key in pathToPartitionInfo varies between stage - in the first stage it was the complete path as returned from the table definitions (eg. hdfs://server/path), and then in subsequent stages, it was the complete path with port (eg. hdfs://server:8020/path) of the result of the previous stage. This isn't a problem in HiveInputFormat since the directory you're looking up always uses the same format as the keys, but in CombineHiveInputFormat, you take that path and look up its children in the file system to get all the block information, and then use one of the returned paths to get the partition info -- and that returned path does not include the port. So, in any stage after the first, we were looking for a path without the port, but all the keys in the map contained a port, so we didn't find anything. Since I didn't fully understand the logic for when the port was included in the path and when it wasn't, my hack fix was just to give CombineHiveInputFormat its own implementation of getPartitionDescFromPath which just walks through partitionDesc and compares using just the path: protected static partitionDesc getPartitionDescFromPath(Map<String, partitionDesc> pathToPartitionInfo, Path dir) throws IOException { for (Map.Entry<String, partitionDesc> entry : pathToPartitionInfo.entrySet()) { try { if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) { return entry.getValue(); } } catch (URISyntaxException e2) {} } throw new IOException("cannot find dir = " + dir.toString() + " in partToPartitionInfo!"); } 3. In a multi-stage query, when one stage returned no data (resulting in a bunch of output files with size 0), the next stage would hang in Hadoop because it would have 0 mappers in the job definition. The issue was that CombineHiveInputFormat would look for blocks, find none, and return 0 splits which would hang Hadoop. There may be good a way to just skip that job altogether, but as a quick hack to get it working, when there were no splits, I'd just create a single empty one so that the job wouldn't hang: at the end of getSplits, I just added: if (result.size() == 0) { Path firstChild = paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath(); CombineFileSplit emptySplit = new CombineFileSplit( job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l}, new String[0]); FixedCombineHiveInputSplit emptySplitWrapper = new FixedCombineHiveInputSplit(job, newHadoop20Shims.InputSplitShim(emptySplit)); result.add(emptySplitWrapper); } With those three changes, it's working beautifully -- some of our queries which previously had thousands of mappers loading very small data files now have a hundred or so and are running about 10x faster. Many thanks for the new functionality! On 12/21/09 2:44 AM, "Zheng Shao" <[email protected]> wrote: > Sorry about the delay. > > Are you using Hive trunk? > > Filed https://issues.apache.org/jira/browse/HIVE-1001 > We should use (new Path(str)).getPath() instead of chopping off the > first 5 chars. > > Zheng > > On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <[email protected]> wrote: >> I'm running into errors where CombinedHiveInputFormat is combining data from >> two different tables which is causing problems because the tables have >> different input formats. >> >> It looks like the problem is in >> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim. It calls >> CombineFileInputFormat.getInputPaths which returns the list of input paths >> and then chops off the first 5 characters to remove file: from the >> beginning, but the return value I'm getting from getInputPaths is actually >> hdfs://domain/path. So then when it creates the pools using these paths, >> none of the input paths match the pools (since they're just the file path >> which protocol or domain). >> >> Any suggestions? >> >> Thanks! >> >> > > > > -- > Yours, > Zheng
