[jira] [Created] (HIVE-2088) Improve Hive Query split calculation performance over large partitions (potentially 5x speedup)

Vaibhav Aggarwal (JIRA) Thu, 31 Mar 2011 18:20:45 -0700

Improve Hive Query split calculation performance over large partitions 
(potentially 5x speedup)
-----------------------------------------------------------------------------------------------


                 Key: HIVE-2088
                 URL: https://issues.apache.org/jira/browse/HIVE-2088
             Project: Hive
          Issue Type: Improvement
            Reporter: Vaibhav Aggarwal


While working on improving start up time for query spanning over large number 
of partitions I found a particular inefficiency in CombineHiveInputFormat.

It seems that for each input partition we create a CombineFilter to make sure 
that files are combined within a particular partition.

CombineHiveInputFormat:

  Path filterPath = path;
  if (!poolSet.contains(filterPath)) {
        LOG.info("CombineHiveInputSplit creating pool for " + path +
            "; using filter path " + filterPath);
        combine.createPool(job, new CombineFilter(filterPath));
        poolSet.add(filterPath);
  }

These filters are passed then passed to CombineFileInputFormat along with all 
the input paths.
CombineFileInputFormat computes a list of all the files in the input paths.
It them loops over each filter and then checks whether a particular file 
belongs to a particular filter.

ConbineFileInputFormat:

for (MultiPathFilter onepool : pools) {
      ArrayList<Path> myPaths = new ArrayList<Path>();
      
      // pick one input path. If it matches all the filters in a pool,
      // add it to the output set
      for (int i = 0; i < paths.length; i++) {
      ...

This results in computation of the order O(p*f) where p is the number of 
partitions and f is the total number of files in all partitions.
For a case of 10,000 partitions with 10 files in each partition, this results 
in 1,000,000,000 iterations.

We can replace this code with processing splits for one input path at a time 
without passing any filters at all.

Do you happen to see a case where this approach will not work?


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HIVE-2088) Improve Hive Query split calculation performance over large partitions (potentially 5x speedup)

Reply via email to