Re: CombinedHiveInputFormat combining across tables

Namit Jain Mon, 21 Dec 2009 21:37:19 -0800

Thanks David,
It would be very useful if you can file jiras and patches for the same.

Thanks,
-namit

On 12/21/09 6:58 PM, "David Lerman" <[email protected]> wrote:

Thanks Zheng.  We're using trunk, r888452.

We actually ended up making three changes to CombineHiveInputFormat.java to
get it working in our environment.  If these aren't known issues, let me
know and I can file bugs and patches in Jira.

1.  The issue mentioned below.  Along the lines you mentioned, we fixed it
by changing:

combine.createPool(job, new CombineFilter(paths[i]));

to:

combine.createPool(job, new CombineFilter(new
Path(paths[i].toUri().getPath())));

and then getting rid of the code that strips the "file:" in
Hadoop20Shims.getInputPathsShim and having it just call
CombineFileInputFormat.getInputPaths(job);

2.  When HiveInputFormat.getPartitionDescFromPath was called from
CombineHiveInputFormat, it was sometimes failing to return a matching
partitionDesc which then caused an Exception down the line since the split
didn't have an inputFormatClassName.  The issue was that the path format
used as the key in pathToPartitionInfo varies between stage - in the first
stage it was the complete path as returned from the table definitions (eg.
hdfs://server/path), and then in subsequent stages, it was the complete path
with port (eg. hdfs://server:8020/path) of the result of the previous stage.
This isn't a problem in HiveInputFormat since the directory you're looking
up always uses the same format as the keys, but in CombineHiveInputFormat,
you take that path and look up its children in the file system to get all
the block information, and then use one of the returned paths to get the
partition info -- and that returned path does not include the port.  So, in
any stage after the first, we were looking for a path without the port, but
all the keys in the map contained a port, so we didn't find anything.

Since I didn't fully understand the logic for when the port was included in
the path and when it wasn't, my hack fix was just to give
CombineHiveInputFormat its own implementation of getPartitionDescFromPath
which just walks through partitionDesc and compares using just the path:

protected static partitionDesc getPartitionDescFromPath(Map<String,
partitionDesc> pathToPartitionInfo, Path dir)
throws IOException {
  for (Map.Entry<String, partitionDesc> entry :
pathToPartitionInfo.entrySet()) {
    try {
      if (new URI(entry.getKey()).getPath().equals(dir.toUri().getPath())) {
        return entry.getValue();
      }
    } catch (URISyntaxException e2) {}
  }
  throw new IOException("cannot find dir = " + dir.toString()

+ " in partToPartitionInfo!");
}

3. In a multi-stage query, when one stage returned no data (resulting in a
bunch of output files with size 0), the next stage would hang in Hadoop
because it would have 0 mappers in the job definition.  The issue was that
CombineHiveInputFormat would look for blocks, find none, and return 0 splits
which would hang Hadoop.  There may be good a way to just skip that job
altogether, but as a quick hack to get it working, when there were no
splits, I'd just create a single empty one so that the job wouldn't hang: at
the end of getSplits, I just added:

if (result.size() == 0) {
  Path firstChild =
    paths[0].getFileSystem(job).listStatus(paths[0])[0].getPath();

  CombineFileSplit emptySplit = new CombineFileSplit(
    job, new Path[]{firstChild}, new long[] {0l}, new long[] {0l},
    new String[0]);
  FixedCombineHiveInputSplit emptySplitWrapper =
    new FixedCombineHiveInputSplit(job,
    newHadoop20Shims.InputSplitShim(emptySplit));

  result.add(emptySplitWrapper);
}

With those three changes, it's working beautifully -- some of our queries
which previously had thousands of mappers loading very small data files now
have a hundred or so and are running about 10x faster.  Many thanks for the
new functionality!

On 12/21/09 2:44 AM, "Zheng Shao" <[email protected]> wrote:

> Sorry about the delay.
>
> Are you using Hive trunk?
>
> Filed https://issues.apache.org/jira/browse/HIVE-1001
> We should use (new Path(str)).getPath() instead of chopping off the
> first 5 chars.
>
> Zheng
>
> On Mon, Dec 14, 2009 at 4:43 PM, David Lerman <[email protected]> wrote:
>> I'm running into errors where CombinedHiveInputFormat is combining data from
>> two different tables which is causing problems because the tables have
>> different input formats.
>>
>> It looks like the problem is in
>> org.apache.hadoop.hive.shims.Hadoop20Shims.getInputPathsShim.  It calls
>> CombineFileInputFormat.getInputPaths which returns the list of input paths
>> and then chops off the first 5 characters to remove file: from the
>> beginning, but the return value I'm getting from getInputPaths is actually
>> hdfs://domain/path.  So then when it creates the pools using these paths,
>> none of the input paths match the pools (since they're just the file path
>> which protocol or domain).
>>
>> Any suggestions?
>>
>> Thanks!
>>
>>
>
>
>
> --
> Yours,
> Zheng

Re: CombinedHiveInputFormat combining across tables

Reply via email to