Hi Anja, As you noticed Hive only have limited supports for cost-baesd optimization. One of the reasons is that Hive used to have very small number of optional execution plans to choose from. One exception is mapjoin vs common joins. Liying Tang had some work on his last intern to convert common joins to mapjoin in a rule-based fashion. One of his future works is to automatically convert common join to mapjoins based on stats. There are also ongoing work on indexes on Hive. With the support of indexes, CBO will be much needed.
In order for a decent CBO to work, we need stats and cost models. There are some work in stats. Table/partition level stats has already been supported. There is a JIRA open for column level stats (HIVE-1362). Cost model is much more complex in Hadoop environment and closely dependent on the mapjoin/index implementations. Given al these in place, we can then talk about plan enumeration etc. So yes, we are interested in CBO, but it is a large area and many missing pieces need to be filled in Hive. If you have particular interest in some area, you can propose your ideas in hive-...@hive.apache.org mailing list or even apply for an intern at FB if you would like to work closely with us. Thanks, Ning On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote: > Hi! > > I'm a graduate student from Georgia Tech and I'm working with Hive for a > research project. I am interested in query optimization and the Hive > MetaStore in that context. Working through the documentation and code, I > noticed that the implementation right now is using a rule-based optimization > system. Therefore, I was wondering whether cost-based query optimization will > be a future task in the development of Hive and if it would be possible for > me to cooperate with the developers of Hive to advance the project in general. > > Best regards, > Anja Gruenheid