Improve Hive Query split calculation performance over large partitions
(potentially 5x speedup)
-----------------------------------------------------------------------------------------------
Key: HIVE-2088
URL: https://issues.apache.org/jira/browse/HIVE-2088
Project: Hive
Issue Type: Improvement
Reporter: Vaibhav Aggarwal
While working on improving start up time for query spanning over large number
of partitions I found a particular inefficiency in CombineHiveInputFormat.
It seems that for each input partition we create a CombineFilter to make sure
that files are combined within a particular partition.
CombineHiveInputFormat:
Path filterPath = path;
if (!poolSet.contains(filterPath)) {
LOG.info("CombineHiveInputSplit creating pool for " + path +
"; using filter path " + filterPath);
combine.createPool(job, new CombineFilter(filterPath));
poolSet.add(filterPath);
}
These filters are passed then passed to CombineFileInputFormat along with all
the input paths.
CombineFileInputFormat computes a list of all the files in the input paths.
It them loops over each filter and then checks whether a particular file
belongs to a particular filter.
ConbineFileInputFormat:
for (MultiPathFilter onepool : pools) {
ArrayList<Path> myPaths = new ArrayList<Path>();
// pick one input path. If it matches all the filters in a pool,
// add it to the output set
for (int i = 0; i < paths.length; i++) {
...
This results in computation of the order O(p*f) where p is the number of
partitions and f is the total number of files in all partitions.
For a case of 10,000 partitions with 10 files in each partition, this results
in 1,000,000,000 iterations.
We can replace this code with processing splits for one input path at a time
without passing any filters at all.
Do you happen to see a case where this approach will not work?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira