Re: CombinedHiveInputFormat combining across tables

David Lerman Tue, 22 Dec 2009 07:24:36 -0800

Thanks Namit.  Filed as HIVE-1006 and HIVE-1007.


On 12/22/09 12:36 AM, "Namit Jain" <[email protected]> wrote:

> Thanks David,
> It would be very useful if you can file jiras and patches for the same.
> 
> 
> Thanks,
> -namit
> 
> 
> On 12/21/09 6:58 PM, "David Lerman" <[email protected]> wrote:
> 
>> Thanks Zheng.  We're using trunk, r888452.
>> 
>> We actually ended up making three changes to CombineHiveInputFormat.java to
>> get it working in our environment.  If these aren't known issues, let me
>> know and I can file bugs and patches in Jira.
>> 
>> 1.  The issue mentioned below.  Along the lines you mentioned, we fixed it
>> by changing:
>> 
>> combine.createPool(job, new CombineFilter(paths[i]));
>> 
>> to:
>> 
>> combine.createPool(job, new CombineFilter(new
>> Path(paths[i].toUri().getPath())));
>> 
>> and then getting rid of the code that strips the "file:" in
>> Hadoop20Shims.getInputPathsShim and having it just call
>> CombineFileInputFormat.getInputPaths(job);
>> 
>> 2.  When HiveInputFormat.getPartitionDescFromPath was called from
>> CombineHiveInputFormat, it was sometimes failing to return a matching
>> partitionDesc which then caused an Exception down the line since the split
>> didn't have an inputFormatClassName.  The issue was that the path format
>> used as the key in pathToPartitionInfo varies between stage - in the first
>> stage it was the complete path as returned from the table definitions (eg.
>> hdfs://server/path), and then in subsequent stages, it was the complete path
>> with port (eg. hdfs://server:8020/path) of the result of the previous stage.
>> This isn't a problem in HiveInputFormat since the directory you're looking
>> up always uses the same format as the keys, but in CombineHiveInputFormat,
>> you take that path and look up its children in the file system to get all
>> the block information, and then use one of the returned paths to get the
>> partition info -- and that returned path does not include the port.  So, in
>> any stage after the first, we were looking for a path without the port, but
>> all the keys in the map contained a port, so we didn't find anything.
>> 
>> Since I didn't fully understand the logic for when the port was included in
>> the path and when it wasn't, my hack fix was just to give
>> CombineHiveInputFormat its own implementation of getPartitionDescFromPath
>> which just walks through partitionDesc and compares using just the path:
>> 
>> protected static partitionDesc getPartitionDescFromPath(Map<String,
>> partitionDesc> pathToPartitionInfo, Path dir)
>> throws IOException {
>>   for (Map.Entry<String, partitionDesc> entry :
>> pathToPartitionInfo.entrySet()) {
>>     try {
>>       if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
>>         return entry.getValue();
>>       }
>>     } catch (URISyntaxException e2) {}
>>   }
>>   throw new IOException("cannot find dir = " + dir.toString()
>> 
>> + " in partToPartitionInfo!");
>> }
>> 
>> 3. In a multi-stage query, when one stage returned no data (resulting in a
>> bunch of output files with size 0), the next stage would hang in Hadoop
>> because it would have 0 mappers in the job definition.  The issue was that
>> CombineHiveInputFormat would look for blocks, find none, and return 0 splits
>> which would hang Hadoop.  There may be good a way to just skip that job
>> altogether, but as a quick hack to get it working, when there were no
>> splits, I'd just create a single empty one so that the job wouldn't hang: at
>> the end of getSplits, I just added:
>> 
>> if (result.size() == 0) {
>>   Path firstChild =
>>     paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();
>> 
>>   CombineFileSplit emptySplit = new CombineFileSplit(
>>     job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
>>     new String[0]);
>>   FixedCombineHiveInputSplit emptySplitWrapper =
>>     new FixedCombineHiveInputSplit(job,
>>     newHadoop20Shims.InputSplitShim(emptySplit));
>> 
>>   result.add(emptySplitWrapper);
>> }
>> 
>> With those three changes, it's working beautifully -- some of our queries
>> which previously had thousands of mappers loading very small data files now
>> have a hundred or so and are running about 10x faster.  Many thanks for the
>> new functionality!
>> 
>> On 12/21/09 2:44 AM, "Zheng Shao" <[email protected]> wrote:
>> 
>>>> Sorry about the delay.
>>>> 
>>>> Are you using Hive trunk?
>>>> 
>>>> Filed https://issues.apache.org/jira/browse/HIVE-1001
>>>> We should use (new Path(str)).getPath() instead of chopping off the
>>>> first 5 chars.
>>>> 
>>>> Zheng
>>>> 
>>>> On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <[email protected]> wrote:
>>>>>> I'm running into errors where CombinedHiveInputFormat is combining data
>>>>>> >>> from
>>>>>> two different tables which is causing problems because the tables have
>>>>>> different input formats.
>>>>>> 
>>>>>> It looks like the problem is in
>>>>>> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
>>>>>> CombineFileInputFormat.getInputPaths which returns the list of input
>>>> paths
>>>>>> and then chops off the first 5 characters to remove file: from the
>>>>>> beginning, but the return value I'm getting from getInputPaths is
>>>> actually
>>>>>> hdfs://domain/path.  So then when it creates the pools using these paths,
>>>>>> none of the input paths match the pools (since they're just the file path
>>>>>> which protocol or domain).
>>>>>> 
>>>>>> Any suggestions?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Yours,
>>>> Zheng
>> 
>>

Re: CombinedHiveInputFormat combining across tables

Reply via email to