Re: Hive optimizer

2016-02-03 Thread John Pullokkaran
Its both.
Some of the optimizations are rule based and some are cost based.

John

From: Ashok Kumar mailto:ashok34...@yahoo.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>, Ashok Kumar 
mailto:ashok34...@yahoo.com>>
Date: Wednesday, February 3, 2016 at 11:45 AM
To: User mailto:user@hive.apache.org>>
Subject: Hive optimizer

  Hi,

Is Hive optimizer a cost based Optimizer (CBO) or a rule based optimizer (CBO) 
or none of them.

thanks


Re: VolcanoPlanner vs HepPlanner

2015-10-07 Thread John Pullokkaran
Yes, we used to do VolcanPlanner before but we switched in Hive 14 time frame 
(2014) due to the performance limitations.
I am not sure why you are after VolcanoPlanner; but yes you can get back to 
VolcanoPlanner and probably get it to work.

I know this because I did most of those changes back in 2014.

HTH
John




From: Raajay Viswanathan mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, October 7, 2015 at 10:11 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: VolcanoPlanner vs HepPlanner

my requirements. Internally, while enumerating the different join orders, it 
destroys "bad" query plans.


Re: VolcanoPlanner vs HepPlanner

2015-10-07 Thread John Pullokkaran
This would be a broad change.
Hep Planner does enumerate different join orders through "LoptOptimizeJoinRule".

Volcano planner is not used as it has scalability issues.

Thanks
John
From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Tuesday, October 6, 2015 at 2:06 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: VolcanoPlanner vs HepPlanner

Hello -

While surfing the code in "apply" function of CalcitePlannerAction 
(CalcitePlanner.java), I see that most rules are passed to a HepPlanner to 
arrive at the "least cost" operator tree.

The HiveVolcanoPlanner although defined and initialized is never invoked to 
find the best operator tree. I want to change the planner from Hep to Volcano, 
for the following reason: to obtain a collection of costlier join orders.

Passing the rules for the Hep Planner as is to HiveVolcano planner does not 
help. The query plan is never altered.

Any ideas, what rules need to be passed to the HiveVolcanoPlanner for effective 
CBO ?



Thanks,
Raajay


Re: [ANNOUNCE] New Hive PMC Chair - Ashutosh Chauhan

2015-09-16 Thread John Pullokkaran
Congrats Ashutosh!

From: Vaibhav Gumashta 
mailto:vgumas...@hortonworks.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, September 16, 2015 at 1:01 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>, 
"d...@hive.apache.org" 
mailto:d...@hive.apache.org>>
Cc: Ashutosh Chauhan mailto:hashut...@apache.org>>
Subject: Re: [ANNOUNCE] New Hive PMC Chair - Ashutosh Chauhan

Congrats Ashutosh!

-Vaibhav

From: Prasanth Jayachandran 
mailto:pjayachand...@hortonworks.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, September 16, 2015 at 12:50 PM
To: "d...@hive.apache.org" 
mailto:d...@hive.apache.org>>, 
"user@hive.apache.org" 
mailto:user@hive.apache.org>>
Cc: "d...@hive.apache.org" 
mailto:d...@hive.apache.org>>, Ashutosh Chauhan 
mailto:hashut...@apache.org>>
Subject: Re: [ANNOUNCE] New Hive PMC Chair - Ashutosh Chauhan

Congratulations Ashutosh!





On Wed, Sep 16, 2015 at 12:48 PM -0700, "Xuefu Zhang" 
mailto:xzh...@cloudera.com>> wrote:

Congratulations, Ashutosh!. Well-deserved.

Thanks to Carl also for the hard work in the past few years!

--Xuefu

On Wed, Sep 16, 2015 at 12:39 PM, Carl Steinbach 
mailto:c...@apache.org>> wrote:

> I am very happy to announce that Ashutosh Chauhan is taking over as the
> new VP of the Apache Hive project. Ashutosh has been a longtime contributor
> to Hive and has played a pivotal role in many of the major advances that
> have been made over the past couple of years. Please join me in
> congratulating Ashutosh on his new role!
>


Re: CBO - get cost of the plan

2015-08-25 Thread John Pullokkaran
#1 The row count estimate for  "tableA" inner join “tableC"
This depends on the selectivity of Join.
The formula is Cardinality(A) * Cardinality(C) * Selectivity
We do have logic to infer PK-FK relation ship based on cardinality & NDV.

#2 what is the definition of cumulative cost
This is the total cost bottom up (including the current OP).

Thanks
John

From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Monday, August 24, 2015 at 8:09 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Subject: Re: CBO - get cost of the plan

Ah okay. Thanks a lot! Now I can get non-default values after collecting table 
level stats.


"CalciteOptimizedPlan" looks like this:

HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], total_sales=[+(+($6, 
$3), $9)]): rowcount = 6.112516920555744E9, cumulative cost = 
{3.8838122487221785E7 rows, 0.0 cpu, 0.0 io}, id = 155
  HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], 
algorithm=[none], cost=[{2.4838122487221785E7 rows, 0.0 cpu, 0.0 io}]): 
rowcount = 6.112516920555744E9, cumulative cost = {3.8838122487221785E7 rows, 
0.0 cpu, 0.0 io}, id = 153
HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], 
algorithm=[none], cost=[{1.4E7 rows, 0.0 cpu, 0.0 io}]): rowcount = 
1.7628122487221785E7, cumulative cost = {1.4E7 rows, 0.0 cpu, 0.0 io}, id = 148
  HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], b_sales=[$3]): 
rowcount = 700.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 143
HiveTableScan(table=[[default.tableb]]): rowcount = 700.0, 
cumulative cost = {0}, id = 44
  HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = 
700.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
HiveTableScan(table=[[default.tablea]]): rowcount = 700.0, 
cumulative cost = {0}, id = 42
HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = 
721.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 151
  HiveTableScan(table=[[default.tablec]]): rowcount = 721.0, cumulative 
cost = {0}, id = 47


The row count estimate for  "tableA" inner join "tableC" is higher than 
expected. Ideally, the size of the join should be less than the cardinality of 
the smallest table, because the keys over which we join are distinct in each 
table. But here the estimates are

|tableA| = |tableB| = 7E6 and |tableA join tableB| = 1.76E7 (values highlighted 
in red in the log snippet above)

Should I be able to explicitly specify it somewhere, so thats gathering is 
accurate ?

Also, what is the definition of cumulative cost ?

Thanks for the help,
Raajay

On Mon, Aug 24, 2015 at 8:51 PM, John Pullokkaran 
mailto:jpullokka...@hortonworks.com>> wrote:
From the text below it seems like you are not collecting table level stats.
You can collect table level stats by following
analyze table  compute statistics;


Thanks
John



On 8/24/15, 6:24 PM, "Raajay" mailto:raaja...@gmail.com>> 
wrote:

>Hi John,
>
>I am on Hive-2.0.0. I forked of the Hive master branch  2 weeks back
>(commit id: 763cb02b5eafb0ecd3fd0eb512636a1b092df671).
>
>I actually have "analyze" before I execute the query. I left it out for
>brevity. Please find the entire query (sent to hive in a file) below.
>Without the analyze commands, I find that CBO optimization is ignored as
>expected. Perhaps I am missing some configuration.
>
>I print out the calcite optimized plans, using the "RelOptUtil.toString()"
>helper on "calciteOptimizedPlan" at the end of "apply" function in
>CalcitePlannerAction.
>
>
>- Raajay
>
>
>
>Query
>=
>
>-- Set the hive configuration
>
>-- clear out the existings tables
>DROP TABLE tableA;
>DROP TABLE tableB;
>DROP TABLE tableC;
>DROP TABLE output_tab;
>
>-- create the tables and load the data
>create external table tableA (a_day int, a_product string, a_alternate
>string, a_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableA_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableA;
>
>create external table tableB (b_day int, b_product string, b_alternate
>string, b_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableB_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableB;
>
>create external table tableC (c_day int, c_product string, c_alternate
>string, c_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL I

Re: CBO - get cost of the plan

2015-08-24 Thread John Pullokkaran
>From the text below it seems like you are not collecting table level stats.
You can collect table level stats by following
analyze table  compute statistics;


Thanks
John



On 8/24/15, 6:24 PM, "Raajay"  wrote:

>Hi John,
>
>I am on Hive-2.0.0. I forked of the Hive master branch  2 weeks back
>(commit id: 763cb02b5eafb0ecd3fd0eb512636a1b092df671).
>
>I actually have "analyze" before I execute the query. I left it out for
>brevity. Please find the entire query (sent to hive in a file) below.
>Without the analyze commands, I find that CBO optimization is ignored as
>expected. Perhaps I am missing some configuration.
>
>I print out the calcite optimized plans, using the "RelOptUtil.toString()"
>helper on "calciteOptimizedPlan" at the end of "apply" function in
>CalcitePlannerAction.
>
>
>- Raajay
>
>
>
>Query
>=
>
>-- Set the hive configuration
>
>-- clear out the existings tables
>DROP TABLE tableA;
>DROP TABLE tableB;
>DROP TABLE tableC;
>DROP TABLE output_tab;
>
>-- create the tables and load the data
>create external table tableA (a_day int, a_product string, a_alternate
>string, a_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableA_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableA;
>
>create external table tableB (b_day int, b_product string, b_alternate
>string, b_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableB_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableB;
>
>create external table tableC (c_day int, c_product string, c_alternate
>string, c_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
>AS TEXTFILE;
>LOAD DATA LOCAL INPATH
>'${hiveconf:CODE_DIR}/data/test/tableC_sc1000_nd7.txt' OVERWRITE INTO
>TABLE
>tableC;
>
>-- compute statistics to be used by calcite CBO
>analyze table tableA compute statistics for columns;
>analyze table tableB compute statistics for columns;
>analyze table tableC compute statistics for columns;
>
>-- create output tables
>create table output_tab (a_day int, a_product string, a_alternate string,
>total_sales int);
>
>-- the query
>insert overwrite table output_tab
>select
>a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as
>total_sales
>from
>tableA a join tableB b
>on a.a_day = b.b_day and a.a_product = b.b_product
>join tableC c
>on b.b_day = c.c_day and b.b_alternate = c.c_alternate;
>
>
>On Mon, Aug 24, 2015 at 7:25 PM, John Pullokkaran <
>jpullokka...@hortonworks.com> wrote:
>
>> In addition to col stats you also need table stats.
>>
>> From: John Pullokkaran 
>> Reply-To: "user@hive.apache.org" 
>> Date: Monday, August 24, 2015 at 5:23 PM
>> To: "user@hive.apache.org" 
>> Cc: "d...@hive.apache.org" 
>> Subject: Re: CBO - get cost of the plan
>>
>> Raajay,
>>
>> You don¹t have col stats hence it assumes 1 for row count.
>> What version of Hive are you on?
>>
>> Thanks
>> John
>>
>> From: Raajay 
>> Reply-To: "user@hive.apache.org" 
>> Date: Monday, August 24, 2015 at 5:19 PM
>> To: "user@hive.apache.org" 
>> Cc: "d...@hive.apache.org" 
>> Subject: CBO - get cost of the plan
>>
>> Hello,
>>
>> I am interested to get the cost of the query plans as calculated by the
>> CBO. How can I get that information ? For example, consider a query
>>with a
>> three way join of the following form:
>>
>> Query
>> =
>>
>> insert overwrite table output_tab
>> select
>> a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as
>>total_sales
>> from
>> tableA a join tableB b
>> on a.a_day = b.b_day and a.a_product = b.b_product
>> join tableC c
>> on b.b_day = c.c_day and b.b_alternate = c.c_alternate;
>>
>>
>> The number of rows for tableA, tableB, and tableC are of the order of
>> 1. I believe, that by "analyzing columns" of all the tables Hive
>>will
>> have statistics regarding the number of rows, distinct values, etc.
>> However, when I try to print out the operator tree as determined by the
>> CalcitePlanner, I get the following output.
>>
>> Print out of the Operator Tree
>> ==
>>
>> HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2],
>> total_sales=[+(+($6, $3), $9)]): rowcount 

Re: CBO - get cost of the plan

2015-08-24 Thread John Pullokkaran
In addition to col stats you also need table stats.

From: John Pullokkaran 
mailto:jpullokka...@hortonworks.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Monday, August 24, 2015 at 5:23 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Cc: "d...@hive.apache.org<mailto:d...@hive.apache.org>" 
mailto:d...@hive.apache.org>>
Subject: Re: CBO - get cost of the plan

Raajay,

You don't have col stats hence it assumes 1 for row count.
What version of Hive are you on?

Thanks
John

From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Monday, August 24, 2015 at 5:19 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Cc: "d...@hive.apache.org<mailto:d...@hive.apache.org>" 
mailto:d...@hive.apache.org>>
Subject: CBO - get cost of the plan

Hello,

I am interested to get the cost of the query plans as calculated by the CBO. 
How can I get that information ? For example, consider a query with a three way 
join of the following form:

Query
=

insert overwrite table output_tab
select
a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales
from
tableA a join tableB b
on a.a_day = b.b_day and a.a_product = b.b_product
join tableC c
on b.b_day = c.c_day and b.b_alternate = c.c_alternate;


The number of rows for tableA, tableB, and tableC are of the order of 1. I 
believe, that by "analyzing columns" of all the tables Hive will have 
statistics regarding the number of rows, distinct values, etc. However, when I 
try to print out the operator tree as determined by the CalcitePlanner, I get 
the following output.

Print out of the Operator Tree
==

HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], total_sales=[+(+($6, 
$3), $9)]): rowcount = 1.0, cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 
150
  HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], 
algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, 
cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148
HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], 
algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, 
cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143
  HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], b_sales=[$3]): 
rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 138
HiveTableScan(table=[[default.tableb]]): rowcount = 1.0, cumulative 
cost = {0}, id = 44
  HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = 1.0, 
cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141
HiveTableScan(table=[[default.tablea]]): rowcount = 1.0, cumulative 
cost = {0}, id = 42
HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = 1.0, 
cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
  HiveTableScan(table=[[default.tablec]]): rowcount = 1.0, cumulative cost 
= {0}, id = 47


The number of rows as displayed here is 1.0, which is clearly not the correct 
value.

- Raajay.




Re: CBO - get cost of the plan

2015-08-24 Thread John Pullokkaran
Raajay,

You don't have col stats hence it assumes 1 for row count.
What version of Hive are you on?

Thanks
John

From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Monday, August 24, 2015 at 5:19 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Cc: "d...@hive.apache.org" 
mailto:d...@hive.apache.org>>
Subject: CBO - get cost of the plan

Hello,

I am interested to get the cost of the query plans as calculated by the CBO. 
How can I get that information ? For example, consider a query with a three way 
join of the following form:

Query
=

insert overwrite table output_tab
select
a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales
from
tableA a join tableB b
on a.a_day = b.b_day and a.a_product = b.b_product
join tableC c
on b.b_day = c.c_day and b.b_alternate = c.c_alternate;


The number of rows for tableA, tableB, and tableC are of the order of 1. I 
believe, that by "analyzing columns" of all the tables Hive will have 
statistics regarding the number of rows, distinct values, etc. However, when I 
try to print out the operator tree as determined by the CalcitePlanner, I get 
the following output.

Print out of the Operator Tree
==

HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], total_sales=[+(+($6, 
$3), $9)]): rowcount = 1.0, cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 
150
  HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], 
algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, 
cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148
HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], 
algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, 
cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143
  HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], b_sales=[$3]): 
rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 138
HiveTableScan(table=[[default.tableb]]): rowcount = 1.0, cumulative 
cost = {0}, id = 44
  HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = 1.0, 
cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141
HiveTableScan(table=[[default.tablea]]): rowcount = 1.0, cumulative 
cost = {0}, id = 42
HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = 1.0, 
cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
  HiveTableScan(table=[[default.tablec]]): rowcount = 1.0, cumulative cost 
= {0}, id = 47


The number of rows as displayed here is 1.0, which is clearly not the correct 
value.

- Raajay.




Re: Hive CBO - Calcite Interface

2015-08-14 Thread John Pullokkaran
Hi Raajay,

#1 No, there is no API for this.
#2 If you enable Logging (BaseSemanticAnalyzer) then CalcitePlanner will print 
out the plan with cost.

John

From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Monday, August 10, 2015 at 8:48 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>, 
"d...@hive.apache.org" 
mailto:d...@hive.apache.org>>
Subject: Hive CBO - Calcite Interface

nterface for Hive to get the absolute cost (based on Hive Cost Factory) of a 
operator tree returned by Calcite ?


Re: Hive indexing optimization

2015-06-30 Thread John Pullokkaran
Index doesn’t seems to be kicking in this case.
Please file a bug for this.

Thanks
John

From: Bennie Leo mailto:tben...@hotmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Monday, June 29, 2015 at 5:25 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: RE: Hive indexing optimization


I've attached the output. Thanks.


B




Subject: Re: Hive indexing optimization
From: jpullokka...@hortonworks.com
To: user@hive.apache.org
Date: Mon, 29 Jun 2015 19:17:44 +

Could you post explain extended output?

From: Bennie Leo mailto:tben...@hotmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Monday, June 29, 2015 at 10:35 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: RE: Hive indexing optimization

Here is the explain output:

STAGE PLANS:
  Stage: Stage-1
Tez
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: logontable
  filterExpr: isipv4(ip) (type: boolean)
  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL 
Column stats: NONE
  Filter Operator
predicate: isipv4(ip) (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
Select Operator
  expressions: ip (type: bigint)
  outputColumnNames: _col0
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
  Reduce Output Operator
sort order:
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
value expressions: _col0 (type: bigint)
Map 3
Map Operator Tree:
TableScan
  alias: ipv4geotable
  Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: startip (type: bigint), endip (type: bigint), 
country (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
Reduce Output Operator
  sort order:
  Statistics: Num rows: 41641243 Data size: 5144651200 
Basic stats: COMPLETE Column stats: NONE
  value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)
Reducer 2
Reduce Operator Tree:
  Join Operator
condition map:
 Left Outer Join0 to 1
condition expressions:
  0 {VALUE._col0}
  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
filter predicates:
  0 {isipv4(VALUE._col0)}
  1
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE
Filter Operator
  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)
  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: bigint), _col3 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
Fetch Operator
  limit: -1

Thank you,
B

> Subject: Re: Hive indexing optimization
> From: jpullokka...@hortonworks.com
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> Date: Sat, 27 Jun 2015 16:02:08 +
>
> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been
> rewritten as a scan against index table.
>
> BitMap Indexes seems to support inequalities (<=, 

Re: Hive indexing optimization

2015-06-29 Thread John Pullokkaran
Could you post explain extended output?

From: Bennie Leo mailto:tben...@hotmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Monday, June 29, 2015 at 10:35 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: RE: Hive indexing optimization

Here is the explain output:

STAGE PLANS:
  Stage: Stage-1
Tez
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: logontable
  filterExpr: isipv4(ip) (type: boolean)
  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL 
Column stats: NONE
  Filter Operator
predicate: isipv4(ip) (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
Select Operator
  expressions: ip (type: bigint)
  outputColumnNames: _col0
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
  Reduce Output Operator
sort order:
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
value expressions: _col0 (type: bigint)
Map 3
Map Operator Tree:
TableScan
  alias: ipv4geotable
  Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: startip (type: bigint), endip (type: bigint), 
country (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
Reduce Output Operator
  sort order:
  Statistics: Num rows: 41641243 Data size: 5144651200 
Basic stats: COMPLETE Column stats: NONE
  value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)
Reducer 2
Reduce Operator Tree:
  Join Operator
condition map:
 Left Outer Join0 to 1
condition expressions:
  0 {VALUE._col0}
  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
filter predicates:
  0 {isipv4(VALUE._col0)}
  1
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE
Filter Operator
  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)
  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: bigint), _col3 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
Fetch Operator
  limit: -1

Thank you,
B

> Subject: Re: Hive indexing optimization
> From: jpullokka...@hortonworks.com
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> Date: Sat, 27 Jun 2015 16:02:08 +
>
> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been
> rewritten as a scan against index table.
>
> BitMap Indexes seems to support inequalities (<=, <, >=).
>
> Post the explain plan.
>
> On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" 
> mailto:gop...@apache.org>> wrote:
>
> >Hi,
> >
> >Hive indexes won¹t really help you speed up that query right now, because
> >of the plan it generates due to the <= clauses.
> >
> >> CREATETABLE ipv4table
> >> AS
> >> SELECT logon.IP, ipv4.Country
> >> FROM
> >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon
> >> LEFT OUTER JOIN
> >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON
> >> isIpv4(logon.IP)
> >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp;
> >
> >That¹s a cross-product join, which can¹t be distributed at all & will take
> >forever, even if you 

Re: Hive indexing optimization

2015-06-27 Thread John Pullokkaran
"SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been
rewritten as a scan against index table.

BitMap Indexes seems to support inequalities (<=, <, >=).

Post the explain plan.

On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan"  wrote:

>Hi,
>
>Hive indexes won¹t really help you speed up that query right now, because
>of the plan it generates due to the <= clauses.
>
>> CREATETABLE ipv4table
>> AS
>> SELECT logon.IP, ipv4.Country
>> FROM
>> (SELECT * FROM logontable WHERE isIpv4(IP)) logon
>> LEFT OUTER JOIN
>> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON
>> isIpv4(logon.IP)
>> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp;
>
>That¹s a cross-product join, which can¹t be distributed at all & will take
>forever, even if you use Tez/LLAP.
>
>Range join queries have partial distribution rewrites, but AFAIK none of
>them use filter indexes in hive.
>
>But before I suggest a rewrite, can you post the ³explain ;² for
>the above query, so that I can check if this is indeed producing a
>cross-product + filter?
>
>Cheers,
>Gopal
>
>
>
>
>
>
>
>



Re: Cost based optimization

2015-06-26 Thread John Pullokkaran
Thanks for your interest.
Please look in to CalcitePlanner.java, 
org.apache.hadoop.hive.ql.optimizer.calcite.* packages for source files.

Thanks
John

From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Thursday, June 25, 2015 at 11:59 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Subject: Re: Cost based optimization

Awesome! Thanks John.

 I would be grateful if you could point me to the files in the source code, 
that are primarily responsible for Query Planning.

Thanks,
Raajay

On Thu, Jun 25, 2015 at 4:45 PM, John Pullokkaran 
mailto:jpullokka...@hortonworks.com>> wrote:
Hive does look in to alternate join orders and pick the best plan that 
minimizes cost.
It uses a greedy algorithm to enumerate plan space.

Thanks
John

From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Thursday, June 25, 2015 at 2:30 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Subject: Cost based optimization

Hello Everyone,

A quick question on the cost-based optimization module in Hive. Does the latest 
version support query plan generation with alternate join orders ?

Thanks
Raajay



Re: Hive indexing optimization

2015-06-25 Thread John Pullokkaran
Set hive.optimize.index.filter=true;

Thanks
John

From: Bennie Leo mailto:tben...@hotmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Thursday, June 25, 2015 at 5:48 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Hive indexing optimization


Hi,

I am attempting to optimize a query using indexing. My current query converts 
an ipv4 address to a country using a geolocation table. However, the 
geolocation table is fairly large and the query takes an impractical amount of 
time. I have created indexes and set the binary search parameter to true 
(default), but the query is not faster.

Here is how I set up indexing:

DROPINDEXIFEXISTS ipv4indexes ON ipv4geotable;
CREATEINDEX ipv4indexes
ONTABLE ipv4geotable (StartIp, EndIp)
AS'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITHDEFERREDREBUILD
IDXPROPERTIES ('hive.index.compact.binary.search'='true');

ALTERINDEX ipv4indexes ON ipv4geotable REBUILD;

And here is my query:

DROPTABLEIFEXISTS ipv4table;
CREATETABLE ipv4table AS
SELECT logon.IP, ipv4.Country
FROM
(SELECT * FROM logontable WHERE isIpv4(IP)) logon
LEFTOUTERJOIN
(SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP)
WHERE ipv4.StartIp <= logon.IPANDlogon.IP <= ipv4.EndIp;

What the query is doing is extracting an IP from logontable and finding in 
which range it lies within the geolocation table (which is sorted). When a 
range is found, the corresponding country is returned. The problem is that Hive 
goes through the whole table row by row rather than performing a smart search 
(ex: binary search).

Any suggestions on how to speed things up?

Thank you,
B


Re: Cost based optimization

2015-06-25 Thread John Pullokkaran
Hive does look in to alternate join orders and pick the best plan that 
minimizes cost.
It uses a greedy algorithm to enumerate plan space.

Thanks
John

From: Raajay mailto:raaja...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Thursday, June 25, 2015 at 2:30 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Cost based optimization

Hello Everyone,

A quick question on the cost-based optimization module in Hive. Does the latest 
version support query plan generation with alternate join orders ?

Thanks
Raajay


Re: Orc file and Hive Optimiser

2015-04-19 Thread John Pullokkaran
If you wish to contribute to CBO, there is a CBO branch on which current dev 
work is being done.
Current dev work is captured by 
HIVE-9132<https://issues.apache.org/jira/browse/HIVE-9132>.

Looking forward to your contributions.

Thanks
John

From: Mich Talebzadeh mailto:m...@peridale.co.uk>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Sunday, April 19, 2015 at 12:48 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Subject: RE: Orc file and Hive Optimiser

Thanks John,

I have already registered my interest on development work for Hive. So 
hopefully I may be able to contribute at some level.

Regards,


Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

From: John Pullokkaran [mailto:jpullokka...@hortonworks.com]
Sent: 19 April 2015 20:37
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Orc file and Hive Optimiser

ORC format is transparent to CBO.
Currently we are working on a new cost model which might reflect ORC’s 
performance advantages in optimization decisions.

Thanks
John

From: Mich Talebzadeh mailto:m...@peridale.co.uk>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Date: Sunday, April 19, 2015 at 12:32 PM
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
mailto:user@hive.apache.org>>
Subject: Orc file and Hive Optimiser

My understanding is that the Optimized Row Columnar (ORC) file format provides 
a highly efficient way to store Hive data.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC


In a nutshell the columnar storage allows pretty efficient compression of 
columns on par with what Data Warehouses databases  like Sybase IQ provide. In 
short if a normal Hive table is “Row based implementation of relational model”, 
then ORC is the equivalent for “Columnar based implementation of relational 
model”

I find ORC file format pretty interesting as it provides a more efficient 
performance compared to other Hive file formats Trying testing it). MY only 
question is whether the Cost Based Optimiser (CBO) of Hive is aware of ORC 
storage format and it treats the table accordingly?

Finally this is more of a speculative question. If we have ORC files that 
provide good functionality, is there any reason why one should deploy a 
columnar database such as Hbase or Cassandra If Hive can do the job as well?

Thanks,


Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.



Re: Orc file and Hive Optimiser

2015-04-19 Thread John Pullokkaran
ORC format is transparent to CBO.
Currently we are working on a new cost model which might reflect ORC's 
performance advantages in optimization decisions.

Thanks
John

From: Mich Talebzadeh mailto:m...@peridale.co.uk>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Sunday, April 19, 2015 at 12:32 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Orc file and Hive Optimiser

My understanding is that the Optimized Row Columnar (ORC) file format provides 
a highly efficient way to store Hive data.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC


In a nutshell the columnar storage allows pretty efficient compression of 
columns on par with what Data Warehouses databases  like Sybase IQ provide. In 
short if a normal Hive table is "Row based implementation of relational model", 
then ORC is the equivalent for "Columnar based implementation of relational 
model"

I find ORC file format pretty interesting as it provides a more efficient 
performance compared to other Hive file formats Trying testing it). MY only 
question is whether the Cost Based Optimiser (CBO) of Hive is aware of ORC 
storage format and it treats the table accordingly?

Finally this is more of a speculative question. If we have ORC files that 
provide good functionality, is there any reason why one should deploy a 
columnar database such as Hbase or Cassandra If Hive can do the job as well?

Thanks,


Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.



Re: [ANNOUNCE] New Hive Committer - John Pullokkaran

2014-10-01 Thread John Pullokkaran
Thanks everyone for all the wishes.

On Wed, Oct 1, 2014 at 1:35 PM, Xiaobing Zhou  wrote:

> Congrats John!
>
> On Wed, Oct 1, 2014 at 12:54 PM, Matthew McCline  > wrote:
>
>> Congrats!
>>
>> On Wed, Oct 1, 2014 at 8:19 AM, Carl Steinbach  wrote:
>>
>> > The Apache Hive PMC has voted to make John Pullokkaran a committer on
>> the
>> > Apache Hive Project.
>> >
>> > Please join me in congratulating John!
>> >
>> > Thanks.
>> >
>> > - Carl
>> >
>> >
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: [ANNOUNCE] New Hive Committer - Gunther Hagleitner

2013-07-22 Thread John Pullokkaran
Congrats Gunther!


On Mon, Jul 22, 2013 at 8:24 AM, Vikram Dixit wrote:

> Congrats Gunther! Great news!
>
> On Jul 21, 2013, at 8:52 PM, Vaibhav Gumashta wrote:
>
> > Congratulations Gunther!
> >
> > On Sun, Jul 21, 2013 at 8:36 PM, David Schorow 
> wrote:
> > Cool.  Congratulations Gunther.  This is great news.
> >
> > David
> >
> > On Jul 21, 2013, at 9:52 AM, Yin Huai wrote:
> >
> >> Congratulations Gunther!!!
> >>
> >> -- Forwarded message --
> >> From: Yin Huai 
> >> Date: Sun, Jul 21, 2013 at 9:51 AM
> >> Subject: Fwd: [ANNOUNCE] New Hive Committer - Gunther Hagleitner
> >> To: yh...@hortonworks.com
> >>
> >>
> >>
> >>
> >> -- Forwarded message --
> >> From: Carl Steinbach 
> >> Date: Sun, Jul 21, 2013 at 1:00 AM
> >> Subject: [ANNOUNCE] New Hive Committer - Gunther Hagleitner
> >> To: "user@hive.apache.org" , d...@hive.apache.org,
> gunt...@apache.org
> >>
> >>
> >> The Apache Hive PMC has voted to make Gunther Hagleitner a
> >> committer on the Apache Hive project.
> >>
> >> Congratulations Gunther!
> >>
> >> Carl
> >>
> >>
> >
> >
>
>