Re: Query Optimization in Hive

Ning Zhang Mon, 31 Jan 2011 19:53:24 -0800

Hi Anja,

As you noticed Hive only have limited supports for cost-baesd optimization. One 
of the reasons is that Hive used to have very small number of optional 
execution plans to choose from. One exception is mapjoin vs common joins. 
Liying Tang had some work on his last intern to convert common joins to mapjoin 
in a rule-based fashion. One of his future works is to automatically convert 
common join to mapjoins based on stats. There are also ongoing work on indexes 
on Hive. With the support of indexes, CBO will be much needed.

In order for a decent CBO to work, we need stats and cost models. There are 
some work in stats. Table/partition level stats has already been supported. 
There is a JIRA open for column level stats (HIVE-1362). Cost model is much 
more complex in Hadoop environment and closely dependent on the mapjoin/index 
implementations. Given al these in place, we can then talk about plan 
enumeration etc. 

So yes, we are interested in CBO, but it is a large area and many missing 
pieces need to be filled in Hive. If you have particular interest in some area, 
you can propose your ideas in hive-...@hive.apache.org mailing list or even 
apply for an intern at FB if you would like to work closely with us. 

Thanks,
Ning

On Jan 31, 2011, at 2:04 PM, Anja Gruenheid wrote:

> Hi!
> 
> I'm a graduate student from Georgia Tech and I'm working with Hive for a 
> research project. I am interested in query optimization and the Hive 
> MetaStore in that context. Working through the documentation and code, I 
> noticed that the implementation right now is using a rule-based optimization 
> system. Therefore, I was wondering whether cost-based query optimization will 
> be a future task in the development of Hive and if it would be possible for 
> me to cooperate with the developers of Hive to advance the project in general.
> 
> Best regards,
> Anja Gruenheid

Re: Query Optimization in Hive

Reply via email to