[ https://issues.apache.org/jira/browse/HIVE-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750971#comment-15750971 ]
Jesus Camacho Rodriguez commented on HIVE-15339: ------------------------------------------------ [~rajesh.balamohan], I was thinking about e.g. folding and transitively inferring the value of the join column, constant folding for some columns, etc. I do not know if there is such case in tpc-ds. Even if prefetching does not result in performance regression for those kind of queries (because batching improves over multiple trips to metastore), I think we can still improve this patch as we might retrieve stats for more columns than necessary at the very beginning of the planning phase (first call to the field trimmer is done in L1019 in CalcitePlanner). In particular, I think we could improve it the following way. The optimizations in {{applyPreJoinOrderingTransforms}} should not need column stats. Thus, I would batch/prefetch the stats in the last call to the column trimmer in that method (L1328 in CalcitePlanner). For the rest of the calls to the trimmer, I would proceed as we used to do. This would imply adding a boolean _fetchStats_ to HiveRelFieldTrimmer, which is _true_ when we create the trimmer in L1328. By default, it should be _false_. I think this will get us the best of both worlds: batching/prefetching, but once we know the minimum set of columns that we need from each table. Does this proposal make sense? > Batch metastore calls to get column stats for fields needed in > FilterSelectivityEstimator > ----------------------------------------------------------------------------------------- > > Key: HIVE-15339 > URL: https://issues.apache.org/jira/browse/HIVE-15339 > Project: Hive > Issue Type: Improvement > Reporter: Rajesh Balamohan > Priority: Minor > Attachments: HIVE-15339.1.patch, HIVE-15339.3.patch > > > Based on query pattern, {{FilterSelectivityEstimator}} gets column statistics > from metastore in multiple calls. For instance, in the following query, it > ends up getting individual column statistics for for flights multiple number > of times. > When the table has large number of partitions, getting statistics for columns > via multiple calls can be very expensive. This would adversely impact the > overall compilation time. The following query took 14 seconds to compile. > {noformat} > SELECT COUNT(`flights`.`flightnum`) AS `cnt_flightnum_ok`, > YEAR(`flights`.`dateofflight`) AS `yr_flightdate_ok` > FROM `flights` as `flights` > JOIN `airlines` ON (`flights`.`uniquecarrier` = `airlines`.`code`) > JOIN `airports` as `source_airport` ON (`flights`.`origin` = > `source_airport`.`iata`) > JOIN `airports` as `dest_airport` ON (`flights`.`dest` = > `dest_airport`.`iata`) > GROUP BY YEAR(`flights`.`dateofflight`); > {noformat} > It may be helpful to club all columns that need statistics and fetch these > details in single remote call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)