Re: Hive optimizer

2016-02-04 Thread Ashok Kumar
Thank you for the link. 

On Thursday, 4 February 2016, 8:07, Lefty Leverenz 
 wrote:
 

 You can find Hive CBO information in Cost Based Optimizer in Hive.
-- Lefty

On Wed, Feb 3, 2016 at 11:48 AM, John Pullokkaran 
 wrote:

Its both.Some of the optimizations are rule based and some are cost based.
John
From: Ashok Kumar 
Reply-To: "user@hive.apache.org" , Ashok Kumar 

Date: Wednesday, February 3, 2016 at 11:45 AM
To: User 
Subject: Hive optimizer

  Hi,
Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) 
or none of them.
thanks



  

Re: Hive optimizer

2016-02-04 Thread Lefty Leverenz
You can find Hive CBO information in Cost Based Optimizer in Hive
<https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive>
.

-- Lefty


On Wed, Feb 3, 2016 at 11:48 AM, John Pullokkaran <
jpullokka...@hortonworks.com> wrote:

> Its both.
> Some of the optimizations are rule based and some are cost based.
>
> John
>
> From: Ashok Kumar 
> Reply-To: "user@hive.apache.org" , Ashok Kumar <
> ashok34...@yahoo.com>
> Date: Wednesday, February 3, 2016 at 11:45 AM
> To: User 
> Subject: Hive optimizer
>
>   Hi,
>
> Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer
> (CBO) or none of them.
>
> thanks
>


Re: Hive optimizer

2016-02-03 Thread John Pullokkaran
Its both.
Some of the optimizations are rule based and some are cost based.

John

From: Ashok Kumar mailto:ashok34...@yahoo.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>, Ashok Kumar 
mailto:ashok34...@yahoo.com>>
Date: Wednesday, February 3, 2016 at 11:45 AM
To: User mailto:user@hive.apache.org>>
Subject: Hive optimizer

  Hi,

Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) 
or none of them.

thanks


Hive optimizer

2016-02-03 Thread Ashok Kumar
  Hi,
Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) 
or none of them.
thanks

Re: Hive Optimizer Questions

2011-07-24 Thread Alexander Alexandrov
I am reposting a message from friday to the Hive developer group to broaden
the audience. Hopefully you won't mind the duplicate:

Hello Hive users,

I study informatics at the TU Berlin. As part of my master's thesis a couple
of months ago I conducted some experiments on Hive (version 0.7.0).
Unfortunately, I don't have access to some critical compile-time data, so I
was hoping you can help me make some sense of some numbers by filling the
gaps or pointing me to sources with information regarding the way the Hive
optimizer works.

For all generated datasets, I used a plain TEXTFILE storage format without a
table-space partitioning (that is, my data was stored in the HDFS inside
large logical CSV like files). I defined a set of minimalistic queries (up
to 4 logical operators working on 1-2 input relations) organized into
distinct subsets depending on the nature of the performed operations (e.g.
embarrassingly parallel queries, parallel aggregation queries, parallel
joins, etc.). For all queries, I performed experiments with increasing
selectivity factor to see how the system reacts to an increase in the amount
of processed data.

I will try to illustrate my questions with short abstract examples -
consider the following general query with a range based filter and an
additive aggregation function:

SELECT   x, AVG(y)
FROM A
WHEREz < {alpha}
GROUP BY x;

The group key column x has a relatively small number of distinct values,
i.e. there are few reduce groups with high number of records in each one of
them. I used increasing {alpha} values to range the selectivity of A from
0.4 to 1.0, but I don't see a corresponding increase in the amount of output
records emitted by the mappers. *Since the value of the
"COMBINE_OUTPUT_RECORDS" counter for all jobs is also zero, I assume that
you somehow push the combine logic inside the Hive-subplan executed by the
mapper - is that correct?*

Similar observations occur if I substitute the additive aggregation function
(AVG) with a built-in "holistic" one (PERCENTILE):

set hive.map.aggr.hash.percentmemory=0.05;
set mapred.child.java.opts=-Xmx24576m;

SELECT   x, PERCENTILE(y, array(0.25, 0.50, 0.75))
FROM A
WHEREz < {alpha}
GROUP BY x;

Again, log stats suggest that there is no Hadoop combiner (I can understand
that in this case), although the number of map output records is the same as
in the previous case. This suggests that somehow (maybe because there are
only 10 distinct groups) the aggregated y-values can be compacted into a
single record per group inside the mapper - I assume this is done by the
hashtable optimization -- *can someone briefly explain when and how exactly
the hash optimization in the PERCENTILE function works?*

I'll be very grateful to any sort of help!

Regards,
Alexander Alexandrov