[ 
https://issues.apache.org/jira/browse/HIVE-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750971#comment-15750971
 ] 

Jesus Camacho Rodriguez commented on HIVE-15339:
------------------------------------------------

[~rajesh.balamohan], I was thinking about e.g. folding and transitively 
inferring the value of the join column, constant folding for some columns, etc. 
I do not know if there is such case in tpc-ds. Even if prefetching does not 
result in performance regression for those kind of queries (because batching 
improves over multiple trips to metastore), I think we can still improve this 
patch as we might retrieve stats for more columns than necessary at the very 
beginning of the planning phase (first call to the field trimmer is done in 
L1019 in CalcitePlanner).

In particular, I think we could improve it the following way. The optimizations 
in {{applyPreJoinOrderingTransforms}} should not need column stats. Thus, I 
would batch/prefetch the stats in the last call to the column trimmer in that 
method (L1328 in CalcitePlanner). For the rest of the calls to the trimmer, I 
would proceed as we used to do.
This would imply adding a boolean _fetchStats_ to HiveRelFieldTrimmer, which is 
_true_ when we create the trimmer in L1328. By default, it should be _false_. I 
think this will get us the best of both worlds: batching/prefetching, but once 
we know the minimum set of columns that we need from each table. Does this 
proposal make sense?

> Batch metastore calls to get column stats for fields needed in 
> FilterSelectivityEstimator
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-15339
>                 URL: https://issues.apache.org/jira/browse/HIVE-15339
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HIVE-15339.1.patch, HIVE-15339.3.patch
>
>
> Based on query pattern, {{FilterSelectivityEstimator}} gets column statistics 
> from metastore in multiple calls. For instance, in the following query, it 
> ends up getting individual column statistics for for flights multiple number 
> of times.
> When the table has large number of partitions, getting statistics for columns 
> via multiple calls can be very expensive. This would adversely impact the 
> overall compilation time. The following query took 14 seconds to compile.
> {noformat}
> SELECT COUNT(`flights`.`flightnum`) AS `cnt_flightnum_ok`,
> YEAR(`flights`.`dateofflight`) AS `yr_flightdate_ok`
> FROM `flights` as `flights`
> JOIN `airlines` ON (`flights`.`uniquecarrier` = `airlines`.`code`)
> JOIN `airports` as `source_airport` ON (`flights`.`origin` = 
> `source_airport`.`iata`)
> JOIN `airports` as `dest_airport` ON (`flights`.`dest` = 
> `dest_airport`.`iata`)
> GROUP BY YEAR(`flights`.`dateofflight`);
> {noformat}
> It may be helpful to club all columns that need statistics and fetch these 
> details in single remote call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to