[
https://issues.apache.org/jira/browse/HIVE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133720#comment-13133720
]
Ashutosh Chauhan commented on HIVE-2088:
----------------------------------------
@Vaibhav,
Looks like this could be a very useful thing if we can make it work. If you got
a chance to experiment with this, it will be good to post your observations
here.
> Improve Hive Query split calculation performance over large partitions
> (potentially 5x speedup)
> -----------------------------------------------------------------------------------------------
>
> Key: HIVE-2088
> URL: https://issues.apache.org/jira/browse/HIVE-2088
> Project: Hive
> Issue Type: Improvement
> Reporter: Vaibhav Aggarwal
>
> While working on improving start up time for query spanning over large number
> of partitions I found a particular inefficiency in CombineHiveInputFormat.
> It seems that for each input partition we create a CombineFilter to make sure
> that files are combined within a particular partition.
> CombineHiveInputFormat:
> Path filterPath = path;
> if (!poolSet.contains(filterPath)) {
> LOG.info("CombineHiveInputSplit creating pool for " + path +
> "; using filter path " + filterPath);
> combine.createPool(job, new CombineFilter(filterPath));
> poolSet.add(filterPath);
> }
> These filters are passed then passed to CombineFileInputFormat along with all
> the input paths.
> CombineFileInputFormat computes a list of all the files in the input paths.
> It them loops over each filter and then checks whether a particular file
> belongs to a particular filter.
> ConbineFileInputFormat:
> for (MultiPathFilter onepool : pools) {
> ArrayList<Path> myPaths = new ArrayList<Path>();
>
> // pick one input path. If it matches all the filters in a pool,
> // add it to the output set
> for (int i = 0; i < paths.length; i++) {
> ...
> This results in computation of the order O(p*f) where p is the number of
> partitions and f is the total number of files in all partitions.
> For a case of 10,000 partitions with 10 files in each partition, this results
> in 1,000,000,000 iterations.
> We can replace this code with processing splits for one input path at a time
> without passing any filters at all.
> Do you happen to see a case where this approach will not work?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira