Re: Hive on TEZ + LLAP

2016-07-19 Thread Mich Talebzadeh
Sounds like if I am correct joining a fact table store_sales; with two
dimensions?

cool

thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 18:31, Gopal Vijayaraghavan  wrote:

> > What was the type (Parquet, text, ORC etc) and row count for each three
> >tables above?
>
> I always use ORC for flat columnar data.
>
> ORC is designed to be ideal if you have measure/dimensions normalized into
> tables - most SQL workloads don't start with an indefinite depth tree.
>
> hive> select count(1) from store_sales;
> OK
> 2879987999
> Time taken: 2.603 seconds, Fetched: 1 row(s)
> hive> select count(1) from store;
> OK
> 1002
> Time taken: 0.213 seconds, Fetched: 1 row(s)
> hive> select count(1) from date_dim;
> OK
> 73049
> Time taken: 0.186 seconds, Fetched: 1 row(s)
> hive>
>
> The DPP semi-join for date_dim is very fast, so out of the ~2.8 billion
> records only 93 million are read into the cache.
>
> Standard TPC-DS data-set at 1000 scale - same layout you can get from
> hive-testbench && ./tpcds-setup.sh 1000;
>
> Cheers,
> Gopal
>
>
>


Re: Hive on TEZ + LLAP

2016-07-19 Thread Gopal Vijayaraghavan
> What was the type (Parquet, text, ORC etc) and row count for each three
>tables above?

I always use ORC for flat columnar data.

ORC is designed to be ideal if you have measure/dimensions normalized into
tables - most SQL workloads don't start with an indefinite depth tree.

hive> select count(1) from store_sales;
OK
2879987999
Time taken: 2.603 seconds, Fetched: 1 row(s)
hive> select count(1) from store;
OK
1002
Time taken: 0.213 seconds, Fetched: 1 row(s)
hive> select count(1) from date_dim;
OK
73049
Time taken: 0.186 seconds, Fetched: 1 row(s)
hive> 

The DPP semi-join for date_dim is very fast, so out of the ~2.8 billion
records only 93 million are read into the cache.

Standard TPC-DS data-set at 1000 scale - same layout you can get from
hive-testbench && ./tpcds-setup.sh 1000;

Cheers,
Gopal




Re: Hive on TEZ + LLAP

2016-07-19 Thread Mich Talebzadeh
Thanks

In this sample query

select  i_brand_id brand_id, i_brand brand,
sum(ss_ext_sales_price) ext_price
 from
*date_dim, store_sales, item * where date_dim.d_date_sk =
store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and i_manager_id=36
and d_moy=12
and d_year=2001
 group by i_brand, i_brand_id
 order by ext_price desc, i_brand_id
limit 100 ;

What was the type (Parquet, text, ORC etc) and row count for each three
tables above?

thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 02:17, Gopal Vijayaraghavan  wrote:

>
> > These looks pretty impressive. What execution mode were you running
> >these? Yarn client may be?
>
> There is no other mode - everything runs on YARN.
>
> > 53 times
>
>
> The factor is actually bigger in actual execution.
>
> The MRv2 version takes 2.47s to prep a query, while the LLAP version takes
> 1.64s.
>
> The MRv2 version takes 200.319s to execute the query, while the LLAP
> version takes 1.02s.
>
> The execution factor is nearly ~200x, but the compile becomes significant
> as you scale down the latencies.
>
> > My calculations on Hive 2 on Spark 1.3.1
>
> Not sure where Hive2-on-Spark is going - the last commit to SparkCompiler
> was late last year, before there was a Hive2.
>
> On the speed front, I'm pretty sure you have got most of the Hive2
> optimizations disabled, even the most basic of the Stinger optimizations
> might be missing for you.
>
> Check if you have
>
> set hive.vectorized.execution.enabled=true;
>
>
> Some of these new optimizations don't work on H-o-S, because Hive-on-Spark
> does not implement a true broadcast join - instead it uses a
> SparkHashTableSinkOperatorwhich actually writes to HDFS instead of sending
> it directy to the downstream task.
>
>
> I don't understand why that is the case instead of RDD brodcast, but that
> prevents the JOIN optimizations which convert the 34 sec query into a 3.8
> sec query from applying to Spark execution.
>
> A couple of examples would be
>
> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
> set hive.vectorized.execution.mapjoin.minmax.enabled=true;
>
> Those two make easy work of joins in LLAP, particularly semi-joins which
> are common in BI queries.
>
>
> Once LLAP is out of tech preview, we can enable most of them by default
> for Tez+LLAP, but that would not mean all of it applies to
> Hive-on-(Spark/MR).
>
> Getting these new features onto another engine takes active effort from
> the engine's devs.
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>
>
>
>


Re: Hive on TEZ + LLAP

2016-07-18 Thread Gopal Vijayaraghavan

> These looks pretty impressive. What execution mode were you running
>these? Yarn client may be?

There is no other mode - everything runs on YARN.

> 53 times


The factor is actually bigger in actual execution.

The MRv2 version takes 2.47s to prep a query, while the LLAP version takes
1.64s.

The MRv2 version takes 200.319s to execute the query, while the LLAP
version takes 1.02s.

The execution factor is nearly ~200x, but the compile becomes significant
as you scale down the latencies.

> My calculations on Hive 2 on Spark 1.3.1

Not sure where Hive2-on-Spark is going - the last commit to SparkCompiler
was late last year, before there was a Hive2.

On the speed front, I'm pretty sure you have got most of the Hive2
optimizations disabled, even the most basic of the Stinger optimizations
might be missing for you.

Check if you have

set hive.vectorized.execution.enabled=true;


Some of these new optimizations don't work on H-o-S, because Hive-on-Spark
does not implement a true broadcast join - instead it uses a
SparkHashTableSinkOperatorwhich actually writes to HDFS instead of sending
it directy to the downstream task.


I don't understand why that is the case instead of RDD brodcast, but that
prevents the JOIN optimizations which convert the 34 sec query into a 3.8
sec query from applying to Spark execution.

A couple of examples would be

set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
set hive.vectorized.execution.mapjoin.minmax.enabled=true;

Those two make easy work of joins in LLAP, particularly semi-joins which
are common in BI queries.


Once LLAP is out of tech preview, we can enable most of them by default
for Tez+LLAP, but that would not mean all of it applies to
Hive-on-(Spark/MR).

Getting these new features onto another engine takes active effort from
the engine's devs.

Cheers,
Gopal












Re: Hive on TEZ + LLAP

2016-07-18 Thread Mich Talebzadeh
These looks pretty impressive. What execution mode were you running these?
Yarn client may be?

 *QueryMR/sec
TEZ/sec TEZ+LLAP/sec*
  203.317   13.681
3.809
*Order of Magnitude*---   15
times53 times
  *faster*


My calculations on Hive 2 on Spark 1.3.1 (obviously we are comparing
different bases but it is interesting as a sample) reflects the following:

Table MR/sec Spark/sec  Order of Magnitude
faster
Parquet   239.53214.38   16 times
ORC   202.33317.77   11 times

So the hybrid engine seems to make much difference which if I just consider
Tez only and Tez + LLAP the gain is more than 3 times

Cheers,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 July 2016 at 23:53, Gopal Vijayaraghavan <gop...@apache.org> wrote:

>
> > Also has there been simple benchmarks to compare:
> >
> > 1. Hive on MR
> > 2. Hine on Tez
> > 3. Hive on Tez with LLAP
>
> I ran one today, with a small BI query in my test suite against a 1Tb
> data-set.
>
> TL;DR - MRv2 (203.317 seconds), Tez (13.681s), LLAP (3.809s).
>
> *Warning*: This is not a historical view, all engines are using the same
> new & improved vectorized operators from 2.2.0-SNAPSHOT, only the physical
> planner and the physical scheduling is different between runs.
>
> The difference between pre-Stinger, Stinger and Stinger.next is much much
> larger than this.
>
> <
> https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-t
> pcds/query55.sql>
>
>
> select  i_brand_id brand_id, i_brand brand,
> sum(ss_ext_sales_price) ext_price
>  from date_dim, store_sales, item
>  where date_dim.d_date_sk = store_sales.ss_sold_date_sk
> and store_sales.ss_item_sk = item.i_item_sk
> and i_manager_id=36
> and d_moy=12
> and d_year=2001
>  group by i_brand, i_brand_id
>  order by ext_price desc, i_brand_id
> limit 100 ;
>
>
> =MRv2==
>
>
> set hive.execution.engine=mr;
>
> ...
> 2016-07-18 22:22:57 Uploaded 1 File to:
> file:/tmp/gopal/b58a60d6-ff05-47bc-ad02-428aaa15779d/hive_2016-07-18_22-22-
> 43_389_3112118969207749230-1/-local-10007/HashTable-Stage-3/MapJoin-mapfile
> 131--.hashtable (914 bytes)
>
> 2016-07-18 22:22:57 End of local task; Time Taken: 2.47 sec.
> ...
> Time taken: 203.317 seconds, Fetched: 100 row(s)
>
> =Tez===
>
>
>
> set hive.execution.engine=tez;
> set hive.llap.execution.mode=none;
>
> Time taken: 13.681 seconds, Fetched: 100 row(s)
>
> =LLAP==
>
>
> set hive.llap.execution.mode=all;
>
>
>
> Task Execution Summary
> ---
> ---
>   VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
> OUTPUT_RECORDS
> ---
> ---
>  Map 11016.00 00 93,123,704
>9,048
>  Map 4   0.00 00 10,000
>   31
>  Map 5   0.00 00296,344
>2,675
>  Reducer 2 207.00 00  9,048
>  100
>  Reducer 3   0.00 00100
>0
> ---
> ---
>
>
> Query Execution Summary
> ---
> ---
> OPERATIONDURATION
> ---
> ---
> Compile Query   1.64s
> Prepare Plan0.32s
> Submit Plan 0.57s
> Start DAG   0.21s
> Run DAG 1.02s
> 

Re: Hive on TEZ + LLAP

2016-07-18 Thread Gopal Vijayaraghavan

> Also has there been simple benchmarks to compare:
> 
> 1. Hive on MR
> 2. Hine on Tez
> 3. Hive on Tez with LLAP

I ran one today, with a small BI query in my test suite against a 1Tb
data-set.

TL;DR - MRv2 (203.317 seconds), Tez (13.681s), LLAP (3.809s).

*Warning*: This is not a historical view, all engines are using the same
new & improved vectorized operators from 2.2.0-SNAPSHOT, only the physical
planner and the physical scheduling is different between runs.

The difference between pre-Stinger, Stinger and Stinger.next is much much
larger than this.

<https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-t
pcds/query55.sql>


select  i_brand_id brand_id, i_brand brand,
sum(ss_ext_sales_price) ext_price
 from date_dim, store_sales, item
 where date_dim.d_date_sk = store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and i_manager_id=36
and d_moy=12
and d_year=2001
 group by i_brand, i_brand_id
 order by ext_price desc, i_brand_id
limit 100 ;


=MRv2==


set hive.execution.engine=mr;

...
2016-07-18 22:22:57 Uploaded 1 File to:
file:/tmp/gopal/b58a60d6-ff05-47bc-ad02-428aaa15779d/hive_2016-07-18_22-22-
43_389_3112118969207749230-1/-local-10007/HashTable-Stage-3/MapJoin-mapfile
131--.hashtable (914 bytes)

2016-07-18 22:22:57 End of local task; Time Taken: 2.47 sec.
...
Time taken: 203.317 seconds, Fetched: 100 row(s)

=Tez===



set hive.execution.engine=tez;
set hive.llap.execution.mode=none;

Time taken: 13.681 seconds, Fetched: 100 row(s)

=LLAP==


set hive.llap.execution.mode=all;



Task Execution Summary
---
---
  VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
OUTPUT_RECORDS
---
---
 Map 11016.00 00 93,123,704
   9,048
 Map 4   0.00 00 10,000
  31
 Map 5   0.00 00296,344
   2,675
 Reducer 2 207.00 00  9,048
 100
 Reducer 3   0.00 00100
   0
---
---


Query Execution Summary
---
---
OPERATIONDURATION
---
---
Compile Query   1.64s
Prepare Plan0.32s
Submit Plan 0.57s
Start DAG   0.21s
Run DAG 1.02s
---
---


Time taken: 3.809 seconds, Fetched: 100 row(s)


Annoyingly now, the 1.64s to compile the query is a huge fraction, since
it only takes 1.02s to execute the join+aggregate over 93 million rows.

Hopefully in a couple of weeks, we'll cut that 1.64s into nearly nothing
once we merge HIVE-13995 into master.


More about the historical view, the new Vectorization codepaths are a big
part of this speed up, when you compare historically or against an
incompletely vectorized format like Parquet (HIVE-8128 looks abandoned).

set hive.vectorized.execution.mapjoin.native.enabled=false;


Time taken: 34.372 seconds, Fetched: 100 row(s)
hive>


Cheers,
Gopal











Re: Hive on TEZ + LLAP

2016-07-16 Thread Mich Talebzadeh
Hi,

This is interesting. Are there any late presentations of Hive on Tez and
Hive on Tez with LLAP.

Also has there been simple benchmarks to compare:


   1. Hive on MR
   2. Hine on Tez
   3. Hive on Tez with LLAP

It would be interesting how these three fare.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 July 2016 at 00:06, Gopal Vijayaraghavan <gop...@apache.org> wrote:

>
> > I have also heard about Hortonworks with Tez + LLAP but that is a distro?
>
> Yes. AFAIK, during Hadoop Summit there was a HDP 2.5 techpreview sandbox
> instance which shipped Hive2 (scroll down all the way to end in the
> downloads page).
>
> Enable the "interactive mode" in Ambari for a HiveServer2 config group &
> HiveServer2 switches over to LLAP.
>
> Though if you're interested in measuring performance, I debate the
> usefulness of an in-memory buffer-cache for a 1-node & cpu/memory
> constrained VM.
>
> > Is it a complicated work to build it with Do It Yourself so to speak?
>
> Complicated enough that I have automated it (at least for myself & most of
> the devs).
>
> https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/README.md
>
> That setup should work as long as you have a base Apache compatible
> hadoop-2.7.1 install.
>
> Because the way to deploy LLAP is a "yarn jar" & then have YARN run the
> instances, no part of the actual deploy requires root on any worker node.
>
> All you need is access to the metastore db (new features in the metastore)
> and a single Zk ensemble to register LLAP onto.
>
> That makes it really easy to "drop into" an existing YARN cluster where
> you're not an admin, but the LLAP install is then tied to a single user
> (you).
>
> That's set up a bit unconventionally since LLAP was never meant to hijack
> a user like this and allow access from the CLI.
>
> The real reason for that is so that I can do hive --debug and debug the
> CLI from remote much more easily than HiveServer2's massive number of
> threads.
>
> I did put up a demo GIF earlier during the Summit, which should give you
> an idea of how fast/slow LLAP is with S3 data (which is when the
> read-through cache really comes into the limelight).
>
> <https://twitter.com/t3rmin4t0r/status/748630764959338497/photo/1>
>
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: Hive on TEZ + LLAP

2016-07-15 Thread Gopal Vijayaraghavan

> I have also heard about Hortonworks with Tez + LLAP but that is a distro?

Yes. AFAIK, during Hadoop Summit there was a HDP 2.5 techpreview sandbox
instance which shipped Hive2 (scroll down all the way to end in the
downloads page).

Enable the "interactive mode" in Ambari for a HiveServer2 config group &
HiveServer2 switches over to LLAP.

Though if you're interested in measuring performance, I debate the
usefulness of an in-memory buffer-cache for a 1-node & cpu/memory
constrained VM.

> Is it a complicated work to build it with Do It Yourself so to speak?

Complicated enough that I have automated it (at least for myself & most of
the devs).

https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/README.md

That setup should work as long as you have a base Apache compatible
hadoop-2.7.1 install.

Because the way to deploy LLAP is a "yarn jar" & then have YARN run the
instances, no part of the actual deploy requires root on any worker node.

All you need is access to the metastore db (new features in the metastore)
and a single Zk ensemble to register LLAP onto.

That makes it really easy to "drop into" an existing YARN cluster where
you're not an admin, but the LLAP install is then tied to a single user
(you).

That's set up a bit unconventionally since LLAP was never meant to hijack
a user like this and allow access from the CLI.

The real reason for that is so that I can do hive --debug and debug the
CLI from remote much more easily than HiveServer2's massive number of
threads.

I did put up a demo GIF earlier during the Summit, which should give you
an idea of how fast/slow LLAP is with S3 data (which is when the
read-through cache really comes into the limelight).




Cheers,
Gopal

















Re: Hive on TEZ + LLAP

2016-07-15 Thread Andrew Sears
HDP 2.5 includes LLAP.

Cheers,
Andrew

On Fri, Jul 15, 2016 at 11:36 AM, Jörn Franke < jornfra...@gmail.com 
[jornfra...@gmail.com] > wrote:
I would recommend a distribution such as Hortonworks were everything is already 
configured. As far as I know llap is currently not part of any distribution.
On 15 Jul 2016, at 17:04, Ashok Kumar < ashok34...@yahoo.com 
[ashok34...@yahoo.com] > wrote:

Hi,
Has anyone managed to make Hive work with Tez + LLAP as the query engine in 
place of Map-reduce please?
If you configured it yourself which version of Tez and LLAP work with Hive 2. 
Do I need to build Tez from source for example
Thanks

Re: Hive on TEZ + LLAP

2016-07-15 Thread Ashok Kumar
thanks.
I have also heard about Hortonworks with Tez + LLAP but that is a distro?
Is it a complicated work to build it with Do It Yourself so to speak?
 

On Friday, 15 July 2016, 19:23, "Long, Andrew" <loand...@amazon.com> wrote:
 

 #yiv7626288998 #yiv7626288998 -- _filtered #yiv7626288998 {panose-1:2 4 5 3 5 
4 6 3 2 4;} _filtered #yiv7626288998 {font-family:Calibri;panose-1:2 15 5 2 2 2 
4 3 2 4;} _filtered #yiv7626288998 {font-family:MingLiU;panose-1:2 2 5 9 0 0 0 
0 0 0;}#yiv7626288998 #yiv7626288998 p.yiv7626288998MsoNormal, #yiv7626288998 
li.yiv7626288998MsoNormal, #yiv7626288998 div.yiv7626288998MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv7626288998 a:link, 
#yiv7626288998 span.yiv7626288998MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv7626288998 a:visited, #yiv7626288998 
span.yiv7626288998MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv7626288998 
span.yiv7626288998EmailStyle17 
{font-family:Calibri;color:windowtext;}#yiv7626288998 span.yiv7626288998msoIns 
{text-decoration:underline;color:teal;}#yiv7626288998 
.yiv7626288998MsoChpDefault {font-size:10.0pt;} _filtered #yiv7626288998 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv7626288998 div.yiv7626288998WordSection1 
{}#yiv7626288998 Amazon AWS has recently released EMR with Hive + Tez as well.  
  Cheers Andrew    From: Jörn Franke <jornfra...@gmail.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>
Date: Friday, July 15, 2016 at 8:36 AM
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: Re: Hive on TEZ + LLAP    I would recommend a distribution such as 
Hortonworks were everything is already configured. As far as I know llap is 
currently not part of any distribution. 
On 15 Jul 2016, at 17:04, Ashok Kumar <ashok34...@yahoo.com> wrote: 
Hi,    Has anyone managed to make Hive work with Tez + LLAP as the query engine 
in place of Map-reduce please?    If you configured it yourself which version 
of Tez and LLAP work with Hive 2. Do I need to build Tez from source for 
example    Thanks 


  

Re: Hive on TEZ + LLAP

2016-07-15 Thread Jörn Franke
I would recommend a distribution such as Hortonworks were everything is already 
configured. As far as I know llap is currently not part of any distribution.

> On 15 Jul 2016, at 17:04, Ashok Kumar <ashok34...@yahoo.com> wrote:
> 
> Hi,
> 
> Has anyone managed to make Hive work with Tez + LLAP as the query engine in 
> place of Map-reduce please?
> 
> If you configured it yourself which version of Tez and LLAP work with Hive 2. 
> Do I need to build Tez from source for example
> 
> Thanks


Hive on TEZ + LLAP

2016-07-15 Thread Ashok Kumar
Hi,
Has anyone managed to make Hive work with Tez + LLAP as the query engine in 
place of Map-reduce please?
If you configured it yourself which version of Tez and LLAP work with Hive 2. 
Do I need to build Tez from source for example
Thanks