Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
I just read further notes on LLAP.

As Gopal explained LLAP has more to do that just in-memory and I quote
Gopal:

"...  LLAP is designed to be hammered by multiple user sessions running
different queries, designed to automate the cache eviction & selection
process. There's no user visible explicit .cache() to remember - it's
automatic and concurrent. ..."

Sounds like what Oracle classic or SAP ASE do in terms of buffer management
strategy. As I understand Spark does not have this concept of hot area
(MRU/LRU chain). It loads data into its memory if needed and gets rid of
it. if ten users read the same table those blocks from that table will be
loaded 10 times which is not efficient.

 LLAP is more intelligent in this respect. So somehow it maintains a Most
Recently Used (MRU), Least Recently Used (LRU) chain. It maintains this
buffer management strategy throughout the cluster. It must be using some
clever algorithm to do so.

Cheers

.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 15:59, Mich Talebzadeh  wrote:

> Thanks Alan. Point taken.
>
> In mitigation, here are members in Spark forum who have shown (interest)
> in using Hive directly and I quote one:
>
> "Did you have any benchmark for using Spark as backend engine for Hive vs
> using Spark thrift server (and run spark code for hive queries)? We are
> using later but it will be very useful to remove thriftserver, if we can. "
>
> Cheers,
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 July 2016 at 15:39, Alan Gates  wrote:
>
>>
>> > On Jul 11, 2016, at 16:22, Mich Talebzadeh 
>> wrote:
>> >
>> > 
>> >   • If I add LLAP, will that be more efficient in terms of memory
>> usage compared to Hive or not? Will it keep the data in memory for reuse or
>> not.
>> >
>> Yes, this is exactly what LLAP does.  It keeps a cache of hot data (hot
>> columns of hot partitions) and shares that across queries.  Unlike many MPP
>> caches it will cache the same data on multiple nodes if it has more workers
>> that want to access the data than can be run on a single node.
>>
>> As a side note, it is considered bad form in Apache to send a message to
>> two lists.  It causes a lot of background noise for people on the Spark
>> list who probably aren’t interested in Hive performance.
>>
>> Alan.
>>
>>
>>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
Thanks Alan. Point taken.

In mitigation, here are members in Spark forum who have shown (interest) in
using Hive directly and I quote one:

"Did you have any benchmark for using Spark as backend engine for Hive vs
using Spark thrift server (and run spark code for hive queries)? We are
using later but it will be very useful to remove thriftserver, if we can. "

Cheers,

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 15:39, Alan Gates  wrote:

>
> > On Jul 11, 2016, at 16:22, Mich Talebzadeh 
> wrote:
> >
> > 
> >   • If I add LLAP, will that be more efficient in terms of memory
> usage compared to Hive or not? Will it keep the data in memory for reuse or
> not.
> >
> Yes, this is exactly what LLAP does.  It keeps a cache of hot data (hot
> columns of hot partitions) and shares that across queries.  Unlike many MPP
> caches it will cache the same data on multiple nodes if it has more workers
> that want to access the data than can be run on a single node.
>
> As a side note, it is considered bad form in Apache to send a message to
> two lists.  It causes a lot of background noise for people on the Spark
> list who probably aren’t interested in Hive performance.
>
> Alan.
>
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Alan Gates

> On Jul 11, 2016, at 16:22, Mich Talebzadeh  wrote:
> 
> 
>   • If I add LLAP, will that be more efficient in terms of memory usage 
> compared to Hive or not? Will it keep the data in memory for reuse or not.
>   
Yes, this is exactly what LLAP does.  It keeps a cache of hot data (hot columns 
of hot partitions) and shares that across queries.  Unlike many MPP caches it 
will cache the same data on multiple nodes if it has more workers that want to 
access the data than can be run on a single node.

As a side note, it is considered bad form in Apache to send a message to two 
lists.  It causes a lot of background noise for people on the Spark list who 
probably aren’t interested in Hive performance.

Alan.




Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
his
>>>>> is a wrong oversimplification and I do not think this is useful for the
>>>>> community, but better is to understand when something can be used and when
>>>>> not. In-memory is also not the solution to everything and if you look for
>>>>> example behind SAP Hana or NoSql there is much more around this, which is
>>>>> not even on the roadmap of Spark.
>>>>>
>>>>> Anyway, discovering good use case patterns should be done on
>>>>> standardized benchmarks going beyond the select count etc
>>>>>
>>>>> On 12 Jul 2016, at 11:16, Mich Talebzadeh 
>>>>> wrote:
>>>>>
>>>>> That is only a plan not what execution engine is doing.
>>>>>
>>>>> As I stated before Spark uses DAG + in-memory computing. MR is serial
>>>>> on disk.
>>>>>
>>>>> The key is the execution here or rather the execution engine.
>>>>>
>>>>> In general
>>>>>
>>>>> The standard MapReduce  as I know reads the data from HDFS, apply
>>>>> map-reduce algorithm and writes back to HDFS. If there are many iterations
>>>>> of map-reduce then, there will be many intermediate writes to HDFS. This 
>>>>> is
>>>>> all serial writes to disk. Each map-reduce step is completely independent
>>>>> of other steps, and the executing engine does not have any global 
>>>>> knowledge
>>>>> of what map-reduce steps are going to come after each map-reduce step. For
>>>>> many iterative algorithms this is inefficient as the data between each
>>>>> map-reduce pair gets written and read from the file system.
>>>>>
>>>>> The equivalent to parallelism in Big Data is deploying what is known
>>>>> as Directed Acyclic Graph (DAG
>>>>> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In
>>>>> a nutshell deploying DAG results in a fuller picture of global 
>>>>> optimisation
>>>>> by deploying parallelism, pipelining consecutive map steps into one and 
>>>>> not
>>>>> writing intermediate data to HDFS. So in short this prevents writing data
>>>>> back and forth after every reduce step which for me is a significant
>>>>> improvement, compared to the classical MapReduce algorithm.
>>>>>
>>>>> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
>>>>> computing. Think of it as a comparison between a classic RDBMS like Oracle
>>>>> and IMDB like Oracle TimesTen with in-memory processing.
>>>>>
>>>>> The outcome is that Hive using Spark as execution engine is pretty
>>>>> impressive. You have the advantage of Hive CBO + In-memory computing. If
>>>>> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
>>>>> optimizer called Catalyst that does not have CBO yet plus in memory
>>>>> computing.
>>>>>
>>>>> As usual your mileage varies.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 12 July 2016 at 09:33, Markovitz, Dudu 
>>>>> wrote:
>>>>>
>>>>>> I don’t see how this explains the time differences.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dudu
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Mich Talebzadeh [mailto

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Marcin Tustin
HDFS, apply
>>>> map-reduce algorithm and writes back to HDFS. If there are many iterations
>>>> of map-reduce then, there will be many intermediate writes to HDFS. This is
>>>> all serial writes to disk. Each map-reduce step is completely independent
>>>> of other steps, and the executing engine does not have any global knowledge
>>>> of what map-reduce steps are going to come after each map-reduce step. For
>>>> many iterative algorithms this is inefficient as the data between each
>>>> map-reduce pair gets written and read from the file system.
>>>>
>>>> The equivalent to parallelism in Big Data is deploying what is known as
>>>> Directed Acyclic Graph (DAG
>>>> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In
>>>> a nutshell deploying DAG results in a fuller picture of global optimisation
>>>> by deploying parallelism, pipelining consecutive map steps into one and not
>>>> writing intermediate data to HDFS. So in short this prevents writing data
>>>> back and forth after every reduce step which for me is a significant
>>>> improvement, compared to the classical MapReduce algorithm.
>>>>
>>>> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
>>>> computing. Think of it as a comparison between a classic RDBMS like Oracle
>>>> and IMDB like Oracle TimesTen with in-memory processing.
>>>>
>>>> The outcome is that Hive using Spark as execution engine is pretty
>>>> impressive. You have the advantage of Hive CBO + In-memory computing. If
>>>> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
>>>> optimizer called Catalyst that does not have CBO yet plus in memory
>>>> computing.
>>>>
>>>> As usual your mileage varies.
>>>>
>>>> HTH
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 12 July 2016 at 09:33, Markovitz, Dudu 
>>>> wrote:
>>>>
>>>>> I don’t see how this explains the time differences.
>>>>>
>>>>>
>>>>>
>>>>> Dudu
>>>>>
>>>>>
>>>>>
>>>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>>>>> *Sent:* Tuesday, July 12, 2016 10:56 AM
>>>>> *To:* user 
>>>>> *Cc:* user @spark 
>>>>>
>>>>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
>>>>> execution engine
>>>>>
>>>>>
>>>>>
>>>>> This the whole idea. Spark uses DAG + IM, MR is classic
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This is for Hive on Spark
>>>>>
>>>>>
>>>>>
>>>>> hive> explain select max(id) from dummy_parquet;
>>>>> OK
>>>>> STAGE DEPENDENCIES:
>>>>>   Stage-1 is a root stage
>>>>>   Stage-0 depends on stages: Stage-1
>>>>>
>>>>> STAGE PLANS:
>>>>>   Stage: Stage-1
>>>>> Spark
>>>>>   Edges:
>>>>> Reducer 2 <- Map 1 (GROUP, 1)
>>>>> *  DagName:
>>>>> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*
>>>>>   Vertices:
>>>>> Map 1
>>>>> Map Operator Tree:
>>>>> TableScan
>>>>>   alias: dummy_parquet
>>>>>   Statistics: Num rows: 1 Data 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
gets written and read from the file system.
>>>
>>> The equivalent to parallelism in Big Data is deploying what is known as
>>> Directed Acyclic Graph (DAG
>>> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a
>>> nutshell deploying DAG results in a fuller picture of global optimisation
>>> by deploying parallelism, pipelining consecutive map steps into one and not
>>> writing intermediate data to HDFS. So in short this prevents writing data
>>> back and forth after every reduce step which for me is a significant
>>> improvement, compared to the classical MapReduce algorithm.
>>>
>>> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
>>> computing. Think of it as a comparison between a classic RDBMS like Oracle
>>> and IMDB like Oracle TimesTen with in-memory processing.
>>>
>>> The outcome is that Hive using Spark as execution engine is pretty
>>> impressive. You have the advantage of Hive CBO + In-memory computing. If
>>> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
>>> optimizer called Catalyst that does not have CBO yet plus in memory
>>> computing.
>>>
>>> As usual your mileage varies.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:
>>>
>>>> I don’t see how this explains the time differences.
>>>>
>>>>
>>>>
>>>> Dudu
>>>>
>>>>
>>>>
>>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>>>> *Sent:* Tuesday, July 12, 2016 10:56 AM
>>>> *To:* user 
>>>> *Cc:* user @spark 
>>>>
>>>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
>>>> execution engine
>>>>
>>>>
>>>>
>>>> This the whole idea. Spark uses DAG + IM, MR is classic
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> This is for Hive on Spark
>>>>
>>>>
>>>>
>>>> hive> explain select max(id) from dummy_parquet;
>>>> OK
>>>> STAGE DEPENDENCIES:
>>>>   Stage-1 is a root stage
>>>>   Stage-0 depends on stages: Stage-1
>>>>
>>>> STAGE PLANS:
>>>>   Stage: Stage-1
>>>> Spark
>>>>   Edges:
>>>> Reducer 2 <- Map 1 (GROUP, 1)
>>>> *  DagName:
>>>> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*
>>>>   Vertices:
>>>> Map 1
>>>> Map Operator Tree:
>>>> TableScan
>>>>   alias: dummy_parquet
>>>>   Statistics: Num rows: 1 Data size: 7
>>>> Basic stats: COMPLETE Column stats: NONE
>>>>   Select Operator
>>>> expressions: id (type: int)
>>>> outputColumnNames: id
>>>> Statistics: Num rows: 1 Data size:
>>>> 7 Basic stats: COMPLETE Column stats: NONE
>>>> Group By Operator
>>>>   aggregations: max(id)
>>>>   mode: hash
>>>>   outputColumnNames: _col0
>>>>   Statistics: Num rows: 1 Data size: 4 Basic stats:
>>>> COMPLETE Column stats: NONE
>>>>   Reduce Output Operator
>>>> sort order:
>>>> Statistics: Num rows: 1 Data size: 4 Basic
>>>> stats:

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Marcin Tustin
ses its own
>> optimizer called Catalyst that does not have CBO yet plus in memory
>> computing.
>>
>> As usual your mileage varies.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:
>>
>>> I don’t see how this explains the time differences.
>>>
>>>
>>>
>>> Dudu
>>>
>>>
>>>
>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>>> *Sent:* Tuesday, July 12, 2016 10:56 AM
>>> *To:* user 
>>> *Cc:* user @spark 
>>>
>>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
>>> execution engine
>>>
>>>
>>>
>>> This the whole idea. Spark uses DAG + IM, MR is classic
>>>
>>>
>>>
>>>
>>>
>>> This is for Hive on Spark
>>>
>>>
>>>
>>> hive> explain select max(id) from dummy_parquet;
>>> OK
>>> STAGE DEPENDENCIES:
>>>   Stage-1 is a root stage
>>>   Stage-0 depends on stages: Stage-1
>>>
>>> STAGE PLANS:
>>>   Stage: Stage-1
>>> Spark
>>>   Edges:
>>> Reducer 2 <- Map 1 (GROUP, 1)
>>> *  DagName:
>>> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*
>>>   Vertices:
>>> Map 1
>>> Map Operator Tree:
>>> TableScan
>>>   alias: dummy_parquet
>>>   Statistics: Num rows: 1 Data size: 7
>>> Basic stats: COMPLETE Column stats: NONE
>>>   Select Operator
>>> expressions: id (type: int)
>>> outputColumnNames: id
>>> Statistics: Num rows: 1 Data size: 7
>>> Basic stats: COMPLETE Column stats: NONE
>>> Group By Operator
>>>   aggregations: max(id)
>>>   mode: hash
>>>   outputColumnNames: _col0
>>>   Statistics: Num rows: 1 Data size: 4 Basic stats:
>>> COMPLETE Column stats: NONE
>>>   Reduce Output Operator
>>> sort order:
>>> Statistics: Num rows: 1 Data size: 4 Basic
>>> stats: COMPLETE Column stats: NONE
>>> value expressions: _col0 (type: int)
>>> Reducer 2
>>> Reduce Operator Tree:
>>>   Group By Operator
>>> aggregations: max(VALUE._col0)
>>> mode: mergepartial
>>> outputColumnNames: _col0
>>> Statistics: Num rows: 1 Data size: 4 Basic stats:
>>> COMPLETE Column stats: NONE
>>> File Output Operator
>>>   compressed: false
>>>   Statistics: Num rows: 1 Data size: 4 Basic stats:
>>> COMPLETE Column stats: NONE
>>>   table:
>>>   input format:
>>> org.apache.hadoop.mapred.TextInputFormat
>>>   output format:
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>   serde:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>
>>>   Stage: Stage-0
>>> Fetch Operator
>>>   limit: -1
>>>   Processor Tree:
>>> ListSink
>>>
>>> Time taken: 2.801 seconds, Fetched: 50 row(s)
>>>
>>>
>>>
>>> And this is with setting the execution engine to MR
>>>
>>>
>>>
>>> hive> set hive.execution.engine=mr;
>>> Hive-on-MR is deprecated in Hive 2 and may 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
chnical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:
>
>> I don’t see how this explains the time differences.
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>> *Sent:* Tuesday, July 12, 2016 10:56 AM
>> *To:* user 
>> *Cc:* user @spark 
>>
>> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
>> execution engine
>>
>>
>>
>> This the whole idea. Spark uses DAG + IM, MR is classic
>>
>>
>>
>>
>>
>> This is for Hive on Spark
>>
>>
>>
>> hive> explain select max(id) from dummy_parquet;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>> Spark
>>   Edges:
>> Reducer 2 <- Map 1 (GROUP, 1)
>> *  DagName:
>> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*
>>   Vertices:
>> Map 1
>> Map Operator Tree:
>> TableScan
>>   alias: dummy_parquet
>>   Statistics: Num rows: 1 Data size: 7
>> Basic stats: COMPLETE Column stats: NONE
>>   Select Operator
>> expressions: id (type: int)
>> outputColumnNames: id
>> Statistics: Num rows: 1 Data size: 7
>> Basic stats: COMPLETE Column stats: NONE
>> Group By Operator
>>   aggregations: max(id)
>>   mode: hash
>>   outputColumnNames: _col0
>>   Statistics: Num rows: 1 Data size: 4 Basic stats:
>> COMPLETE Column stats: NONE
>>   Reduce Output Operator
>> sort order:
>> Statistics: Num rows: 1 Data size: 4 Basic stats:
>> COMPLETE Column stats: NONE
>> value expressions: _col0 (type: int)
>> Reducer 2
>> Reduce Operator Tree:
>>   Group By Operator
>> aggregations: max(VALUE._col0)
>> mode: mergepartial
>> outputColumnNames: _col0
>> Statistics: Num rows: 1 Data size: 4 Basic stats:
>> COMPLETE Column stats: NONE
>> File Output Operator
>>   compressed: false
>>   Statistics: Num rows: 1 Data size: 4 Basic stats:
>> COMPLETE Column stats: NONE
>>   table:
>>   input format:
>> org.apache.hadoop.mapred.TextInputFormat
>>   output format:
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>   serde:
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>
>>   Stage: Stage-0
>> Fetch Operator
>>   limit: -1
>>   Processor Tree:
>> ListSink
>>
>> Time taken: 2.801 seconds, Fetched: 50 row(s)
>>
>>
>>
>> And this is with setting the execution engine to MR
>>
>>
>>
>> hive> set hive.execution.engine=mr;
>> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
>> versions. Consider using a different execution engine (i.e. spark, tez) or
>> using Hive 1.X releases.
>>
>>
>>
>> hive> explain select max(id) from dummy_parquet;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>> Map Reduce
>>   Map Operator Tree:
>>   TableScan
>> alias: dummy_parquet
>> Statistics: Num rows: 1 Data size: 7 Basic
>> stats: COMPLETE Column stats: NONE
>> Select Operator
>>   expressions: id (type: int)
>>   outputColumnNames: id
>>   Statistics: Num rows: 1 Data size: 7 Basic
>> stats: COMPLETE Column stats: NONE
>>   Group By Operator
>> aggregations: max(id)
>> mode: hash
>> outputColumnNames: _col0
>> Statistics: Num rows: 1 Data size: 4 Basic stats:
>> COMPLETE Column stat

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Jörn Franke

I think the comparison with Oracle rdbms and oracle times ten is not so good. 
There are times when the in-memory database of Oracle is slower than the rdbms 
(especially in case of Exadata) due to the issue that in-memory - as in Spark - 
means everything is in memory and everything is always processed (no storage 
indexes , no bloom filters etc) which explains this behavior quiet well.

Hence, I do not agree with the statement that tez is basically mr with dag (or 
that llap is basically in-memory which is also not correct). This is a wrong 
oversimplification and I do not think this is useful for the community, but 
better is to understand when something can be used and when not. In-memory is 
also not the solution to everything and if you look for example behind SAP Hana 
or NoSql there is much more around this, which is not even on the roadmap of 
Spark.

Anyway, discovering good use case patterns should be done on standardized 
benchmarks going beyond the select count etc 

> On 12 Jul 2016, at 11:16, Mich Talebzadeh  wrote:
> 
> That is only a plan not what execution engine is doing.
> 
> As I stated before Spark uses DAG + in-memory computing. MR is serial on 
> disk. 
> 
> The key is the execution here or rather the execution engine.
> 
> In general
> 
> The standard MapReduce  as I know reads the data from HDFS, apply map-reduce 
> algorithm and writes back to HDFS. If there are many iterations of map-reduce 
> then, there will be many intermediate writes to HDFS. This is all serial 
> writes to disk. Each map-reduce step is completely independent of other 
> steps, and the executing engine does not have any global knowledge of what 
> map-reduce steps are going to come after each map-reduce step. For many 
> iterative algorithms this is inefficient as the data between each map-reduce 
> pair gets written and read from the file system.
> 
> The equivalent to parallelism in Big Data is deploying what is known as 
> Directed Acyclic Graph (DAG) algorithm. In a nutshell deploying DAG results 
> in a fuller picture of global optimisation by deploying parallelism, 
> pipelining consecutive map steps into one and not writing intermediate data 
> to HDFS. So in short this prevents writing data back and forth after every 
> reduce step which for me is a significant improvement, compared to the 
> classical MapReduce algorithm.
> 
> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory 
> computing. Think of it as a comparison between a classic RDBMS like Oracle 
> and IMDB like Oracle TimesTen with in-memory processing.
> 
> The outcome is that Hive using Spark as execution engine is pretty 
> impressive. You have the advantage of Hive CBO + In-memory computing. If you 
> use Spark for all this (say Spark SQL) but no Hive, Spark uses its own 
> optimizer called Catalyst that does not have CBO yet plus in memory computing.
> 
> As usual your mileage varies.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:
>> I don’t see how this explains the time differences.
>> 
>>  
>> 
>> Dudu
>> 
>>  
>> 
>> From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] 
>> Sent: Tuesday, July 12, 2016 10:56 AM
>> To: user 
>> Cc: user @spark 
>> 
>> 
>> Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
>> engine
>>  
>> 
>> This the whole idea. Spark uses DAG + IM, MR is classic
>> 
>>  
>> 
>>  
>> 
>> This is for Hive on Spark
>> 
>>  
>> 
>> hive> explain select max(id) from dummy_parquet;
>> OK
>> STAGE DEPENDENCIES:
>>   Stage-1 is a root stage
>>   Stage-0 depends on stages: Stage-1
>> 
>> STAGE PLANS:
>>   Stage: Stage-1
>> Spark
>>   Edges:
>> Reducer 2 <- Map 1 (GROUP, 1)
>>   DagName: hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1
>>   Vertices:
>> Map 1
>> Map Operator Tree:
>> TableScan
>>   alias: dummy_parquet
>>   Statistics: Num rows: 1 Data size: 7 Basic 
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
I suggest that you try it for yourself then

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 10:35, Markovitz, Dudu  wrote:

> The principals are very clear and if our use-case was a complex one,
> combined from many stages I would expect performance benefits from the
> Spark engine.
>
> Since our use-case is a simple one and most of the work here is just
> reading the files, I don’t see how we can explain the performance
> differences unless the data was already cached in the Spark test.
>
> Clearly, we’re missing something.
>
>
>
> Dudu
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, July 12, 2016 12:16 PM
>
> *To:* user 
> *Cc:* user @spark 
> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> That is only a plan not what execution engine is doing.
>
>
>
> As I stated before Spark uses DAG + in-memory computing. MR is serial on
> disk.
>
>
>
> The key is the execution here or rather the execution engine.
>
>
>
> In general
>
>
>
>
> The standard MapReduce  as I know reads the data from HDFS, apply
> map-reduce algorithm and writes back to HDFS. If there are many iterations
> of map-reduce then, there will be many intermediate writes to HDFS. This is
> all serial writes to disk. Each map-reduce step is completely independent
> of other steps, and the executing engine does not have any global knowledge
> of what map-reduce steps are going to come after each map-reduce step. For
> many iterative algorithms this is inefficient as the data between each
> map-reduce pair gets written and read from the file system.
>
>
>
> The equivalent to parallelism in Big Data is deploying what is known as
> Directed Acyclic Graph (DAG
> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a
> nutshell deploying DAG results in a fuller picture of global optimisation
> by deploying parallelism, pipelining consecutive map steps into one and not
> writing intermediate data to HDFS. So in short this prevents writing data
> back and forth after every reduce step which for me is a significant
> improvement, compared to the classical MapReduce algorithm.
>
>
>
> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
> computing. Think of it as a comparison between a classic RDBMS like Oracle
> and IMDB like Oracle TimesTen with in-memory processing.
>
>
>
> The outcome is that Hive using Spark as execution engine is pretty
> impressive. You have the advantage of Hive CBO + In-memory computing. If
> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
> optimizer called Catalyst that does not have CBO yet plus in memory
> computing.
>
>
>
> As usual your mileage varies.
>
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:
>
> I don’t see how this explains the time differences.
>
>
>
> Dudu
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, July 12, 2016 10:56 AM
> *To:* user 
> *Cc:* user @spark 
>
>
> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> This the whole idea. Spark uses DAG + IM, MR is classic
>
>
>
>
>
> This is for Hive on Spark
>
>
>
> hive> explain select max(id) from dummy_parquet;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: St

RE: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Markovitz, Dudu
The principals are very clear and if our use-case was a complex one, combined 
from many stages I would expect performance benefits from the Spark engine.
Since our use-case is a simple one and most of the work here is just reading 
the files, I don’t see how we can explain the performance differences unless 
the data was already cached in the Spark test.
Clearly, we’re missing something.

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, July 12, 2016 12:16 PM
To: user 
Cc: user @spark 
Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
engine

That is only a plan not what execution engine is doing.

As I stated before Spark uses DAG + in-memory computing. MR is serial on disk.

The key is the execution here or rather the execution engine.

In general

The standard MapReduce  as I know reads the data from HDFS, apply map-reduce 
algorithm and writes back to HDFS. If there are many iterations of map-reduce 
then, there will be many intermediate writes to HDFS. This is all serial writes 
to disk. Each map-reduce step is completely independent of other steps, and the 
executing engine does not have any global knowledge of what map-reduce steps 
are going to come after each map-reduce step. For many iterative algorithms 
this is inefficient as the data between each map-reduce pair gets written and 
read from the file system.

The equivalent to parallelism in Big Data is deploying what is known as 
Directed Acyclic Graph 
(DAG<https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a 
nutshell deploying DAG results in a fuller picture of global optimisation by 
deploying parallelism, pipelining consecutive map steps into one and not 
writing intermediate data to HDFS. So in short this prevents writing data back 
and forth after every reduce step which for me is a significant improvement, 
compared to the classical MapReduce algorithm.

Now Tez is basically MR with DAG. With Spark you get DAG + in-memory computing. 
Think of it as a comparison between a classic RDBMS like Oracle and IMDB like 
Oracle TimesTen with in-memory processing.

The outcome is that Hive using Spark as execution engine is pretty impressive. 
You have the advantage of Hive CBO + In-memory computing. If you use Spark for 
all this (say Spark SQL) but no Hive, Spark uses its own optimizer called 
Catalyst that does not have CBO yet plus in memory computing.

As usual your mileage varies.

HTH



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 12 July 2016 at 09:33, Markovitz, Dudu 
mailto:dmarkov...@paypal.com>> wrote:
I don’t see how this explains the time differences.

Dudu

From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>]
Sent: Tuesday, July 12, 2016 10:56 AM
To: user mailto:user@hive.apache.org>>
Cc: user @spark mailto:u...@spark.apache.org>>

Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
engine

This the whole idea. Spark uses DAG + IM, MR is classic


This is for Hive on Spark

hive> explain select max(id) from dummy_parquet;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (GROUP, 1)
  DagName: hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: dummy_parquet
  Statistics: Num rows: 1 Data size: 7 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: id (type: int)
outputColumnNames: id
Statistics: Num rows: 1 Data size: 7 Basic 
stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: max(id)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: NONE
value expressions: _col0 (type: int)
Reducer 2
Reduce Operator Tree:
  Group By Operator
aggregations: max(VALUE._col0)
mode: mergepartial

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
That is only a plan not what execution engine is doing.

As I stated before Spark uses DAG + in-memory computing. MR is serial on
disk.

The key is the execution here or rather the execution engine.

In general

The standard MapReduce  as I know reads the data from HDFS, apply
map-reduce algorithm and writes back to HDFS. If there are many iterations
of map-reduce then, there will be many intermediate writes to HDFS. This is
all serial writes to disk. Each map-reduce step is completely independent
of other steps, and the executing engine does not have any global knowledge
of what map-reduce steps are going to come after each map-reduce step. For
many iterative algorithms this is inefficient as the data between each
map-reduce pair gets written and read from the file system.

The equivalent to parallelism in Big Data is deploying what is known as
Directed Acyclic Graph (DAG
<https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a
nutshell deploying DAG results in a fuller picture of global optimisation
by deploying parallelism, pipelining consecutive map steps into one and not
writing intermediate data to HDFS. So in short this prevents writing data
back and forth after every reduce step which for me is a significant
improvement, compared to the classical MapReduce algorithm.

Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
computing. Think of it as a comparison between a classic RDBMS like Oracle
and IMDB like Oracle TimesTen with in-memory processing.

The outcome is that Hive using Spark as execution engine is pretty
impressive. You have the advantage of Hive CBO + In-memory computing. If
you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
optimizer called Catalyst that does not have CBO yet plus in memory
computing.

As usual your mileage varies.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:

> I don’t see how this explains the time differences.
>
>
>
> Dudu
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, July 12, 2016 10:56 AM
> *To:* user 
> *Cc:* user @spark 
>
> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> This the whole idea. Spark uses DAG + IM, MR is classic
>
>
>
>
>
> This is for Hive on Spark
>
>
>
> hive> explain select max(id) from dummy_parquet;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   Edges:
> Reducer 2 <- Map 1 (GROUP, 1)
> *  DagName:
> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: dummy_parquet
>   Statistics: Num rows: 1 Data size: 7
> Basic stats: COMPLETE Column stats: NONE
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 7
> Basic stats: COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(id)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: int)
> Reducer 2
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Colum

RE: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Markovitz, Dudu
I don’t see how this explains the time differences.

Dudu

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, July 12, 2016 10:56 AM
To: user 
Cc: user @spark 
Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
engine

This the whole idea. Spark uses DAG + IM, MR is classic


This is for Hive on Spark

hive> explain select max(id) from dummy_parquet;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (GROUP, 1)
  DagName: hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1
  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: dummy_parquet
  Statistics: Num rows: 1 Data size: 7 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: id (type: int)
outputColumnNames: id
Statistics: Num rows: 1 Data size: 7 Basic 
stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: max(id)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: NONE
  Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 4 Basic stats: 
COMPLETE Column stats: NONE
value expressions: _col0 (type: int)
Reducer 2
Reduce Operator Tree:
  Group By Operator
aggregations: max(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
Time taken: 2.801 seconds, Fetched: 50 row(s)

And this is with setting the execution engine to MR

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
versions. Consider using a different execution engine (i.e. spark, tez) or 
using Hive 1.X releases.

hive> explain select max(id) from dummy_parquet;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: dummy_parquet
Statistics: Num rows: 1 Data size: 7 Basic stats: 
COMPLETE Column stats: NONE
Select Operator
  expressions: id (type: int)
  outputColumnNames: id
  Statistics: Num rows: 1 Data size: 7 Basic stats: 
COMPLETE Column stats: NONE
  Group By Operator
aggregations: max(id)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
Reduce Output Operator
  sort order:
  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: NONE
  value expressions: _col0 (type: int)
  Reduce Operator Tree:
Group By Operator
  aggregations: max(VALUE._col0)
  mode: mergepartial
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
Time taken: 0.1 seconds, Fetched: 44 row(s)


HTH



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical cont

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
> This is a simple task –
>
> Read the files, find the local max value and combine the results (find the
> global max value).
>
> How do you explain the differences in the results? Spark reads the files
> and finds a local max 10X (+) faster than MR?
>
> Can you please attach the execution plan?
>
>
>
> Thanks
>
>
>
> Dudu
>
>
>
>
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Monday, July 11, 2016 11:55 PM
> *To:* user ; user @spark 
> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> In my test I did like for like keeping the systematic the same namely:
>
>
>
>1. Table was a parquet table of 100 Million rows
>2. The same set up was used for both Hive on Spark and Hive on MR
>3. Spark was very impressive compared to MR on this particular test.
>
>
>
> Just to see any issues I created an ORC table in in the image of Parquet
> (insert/select from Parquet to ORC) with stats updated for columns etc
>
>
>
> These were the results of the same run using ORC table this time:
>
>
>
> hive> select max(id) from oraclehadoop.dummy;
>
> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
>
> Query Hive on Spark job[1] stages:
> 2
> 3
>
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 16.08 seconds
> OK
> 1
> Time taken: 17.775 seconds, Fetched: 1 row(s)
>
>
>
> Repeat with MR engine
>
>
>
> hive> set hive.execution.engine=mr;
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
> versions. Consider using a different execution engine (i.e. spark, tez) or
> using Hive 1.X releases.
>
>
>
> hive> select max(id) from oraclehadoop.dummy;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
> the future versions. Consider using a different execution engine (i.e.
> spark, tez) or using Hive 1.X releases.
> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> Starting Job = job_1468226887011_0008, Tracking URL =
> http://rhes564:8088/proxy/application_1468226887011_0008/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1468226887011_0008
> Hadoop job information for Stage-1: number of mappers: 23; number of
> reducers: 1
> 2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
> 2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
> 16.48 sec
> 2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
> 40.63 sec
> 2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
> 58.88 sec
> 2016-07-11 21:37:30,412 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
> 80.72 sec
> 2016-07-11 21:37:37,707 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU
> 103.43 sec
> 2016-07-11 21:37:45,999 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU
> 125.93 sec
> 2016-07-11 21:37:54,300 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU
> 147.17 sec
> 2016-07-11 21:38:01,538 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU
> 166.56 sec
> 2016-07-11 21:38:08,807 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU
> 189.29 sec
> 2016-07-11 21:38:17,115 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU
> 211.03 sec
> 2016-07-11 21:38:24,363 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU
>

RE: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Markovitz, Dudu
This is a simple task –
Read the files, find the local max value and combine the results (find the 
global max value).
How do you explain the differences in the results? Spark reads the files and 
finds a local max 10X (+) faster than MR?
Can you please attach the execution plan?

Thanks

Dudu



From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Monday, July 11, 2016 11:55 PM
To: user ; user @spark 
Subject: Re: Using Spark on Hive with Hive also using Spark as its execution 
engine

In my test I did like for like keeping the systematic the same namely:


  1.  Table was a parquet table of 100 Million rows
  2.  The same set up was used for both Hive on Spark and Hive on MR
  3.  Spark was very impressive compared to MR on this particular test.

Just to see any issues I created an ORC table in in the image of Parquet 
(insert/select from Parquet to ORC) with stats updated for columns etc

These were the results of the same run using ORC table this time:

hive> select max(id) from oraclehadoop.dummy;

Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]
2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1 Finished
Status: Finished successfully in 16.08 seconds
OK
1
Time taken: 17.775 seconds, Fetched: 1 row(s)

Repeat with MR engine

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future 
versions. Consider using a different execution engine (i.e. spark, tez) or 
using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1468226887011_0008, Tracking URL = 
http://rhes564:8088/proxy/application_1468226887011_0008/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
job_1468226887011_0008
Hadoop job information for Stage-1: number of mappers: 23; number of reducers: 1
2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 16.48 sec
2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 40.63 sec
2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 58.88 
sec
2016-07-11 21:37:30,412 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU 80.72 
sec
2016-07-11 21:37:37,707 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 103.43 
sec
2016-07-11 21:37:45,999 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 125.93 
sec
2016-07-11 21:37:54,300 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 147.17 
sec
2016-07-11 21:38:01,538 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 166.56 
sec
2016-07-11 21:38:08,807 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU 189.29 
sec
2016-07-11 21:38:17,115 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 211.03 
sec
2016-07-11 21:38:24,363 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU 235.68 
sec
2016-07-11 21:38:32,638 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 258.27 
sec
2016-07-11 21:38:40,916 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 278.44 
sec
2016-07-11 21:38:49,206 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 300.35 
sec
2016-07-11 21:38:58,524 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 322.89 
sec
2016-07-11 21:39:07,889 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 344.8 
sec
2016-07-11 21:39:16,151 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU 367.77 
sec
2016-07-11 21:39:25,456 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU 391.82 
sec
2016-07-11 21:39:33,725 Stage-1 map = 83%,  reduce = 0%, C

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Another point with Hive on spark and Hive on Tez + LLAP, I am thinking loud
:)


   1. I am using Hive on Spark and I have a table of 10GB say with 100
   users concurrently accessing the same partition of ORC table  (last one
   hour or so)
   2. Spark takes data and puts in in memory. I gather only data for that
   partition will be loaded for 100 users. In other words there will be 100
   copies.
   3. Spark unlike RDBMS does not have the notion of hot cache or Most
   Recently Used (MRU) or Least Recently Used. So once the user finishes data
   is released from Spark memory. The next user will load that data again.
   Potentially this is somehow wasteful of resources?
   4. With Tez we only have DAG. It is MR with DAG. So the same algorithm
   will be applied to 100 users session but no memory usage
   5. If I add LLAP, will that be more efficient in terms of memory usage
   compared to Hive or not? Will it keep the data in memory for reuse or not.
   6. What I don't understand what makes Tez and LLAP more efficient
   compared to Spark!

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 21:54, Mich Talebzadeh  wrote:

> In my test I did like for like keeping the systematic the same namely:
>
>
>1. Table was a parquet table of 100 Million rows
>2. The same set up was used for both Hive on Spark and Hive on MR
>3. Spark was very impressive compared to MR on this particular test.
>
>
> Just to see any issues I created an ORC table in in the image of Parquet
> (insert/select from Parquet to ORC) with stats updated for columns etc
>
> These were the results of the same run using ORC table this time:
>
> hive> select max(id) from oraclehadoop.dummy;
>
> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 16.08 seconds
> OK
> 1
> Time taken: 17.775 seconds, Fetched: 1 row(s)
>
> Repeat with MR engine
>
> hive> set hive.execution.engine=mr;
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
> versions. Consider using a different execution engine (i.e. spark, tez) or
> using Hive 1.X releases.
>
> hive> select max(id) from oraclehadoop.dummy;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
> the future versions. Consider using a different execution engine (i.e.
> spark, tez) or using Hive 1.X releases.
> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> Starting Job = job_1468226887011_0008, Tracking URL =
> http://rhes564:8088/proxy/application_1468226887011_0008/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1468226887011_0008
> Hadoop job information for Stage-1: number of mappers: 23; number of
> reducers: 1
> 2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
> 2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
> 16.48 sec
> 2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
> 40.63 sec
> 2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
> 58.88 sec
> 20

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
In my test I did like for like keeping the systematic the same namely:


   1. Table was a parquet table of 100 Million rows
   2. The same set up was used for both Hive on Spark and Hive on MR
   3. Spark was very impressive compared to MR on this particular test.


Just to see any issues I created an ORC table in in the image of Parquet
(insert/select from Parquet to ORC) with stats updated for columns etc

These were the results of the same run using ORC table this time:

hive> select max(id) from oraclehadoop.dummy;

Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
Finished
Status: Finished successfully in 16.08 seconds
OK
1
Time taken: 17.775 seconds, Fetched: 1 row(s)

Repeat with MR engine

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or
using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1468226887011_0008, Tracking URL =
http://rhes564:8088/proxy/application_1468226887011_0008/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1468226887011_0008
Hadoop job information for Stage-1: number of mappers: 23; number of
reducers: 1
2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
2016-07-11 21:37:06,440 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
16.48 sec
2016-07-11 21:37:14,751 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
40.63 sec
2016-07-11 21:37:22,048 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
58.88 sec
2016-07-11 21:37:30,412 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
80.72 sec
2016-07-11 21:37:37,707 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU
103.43 sec
2016-07-11 21:37:45,999 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU
125.93 sec
2016-07-11 21:37:54,300 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU
147.17 sec
2016-07-11 21:38:01,538 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU
166.56 sec
2016-07-11 21:38:08,807 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU
189.29 sec
2016-07-11 21:38:17,115 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU
211.03 sec
2016-07-11 21:38:24,363 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU
235.68 sec
2016-07-11 21:38:32,638 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU
258.27 sec
2016-07-11 21:38:40,916 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU
278.44 sec
2016-07-11 21:38:49,206 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU
300.35 sec
2016-07-11 21:38:58,524 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU
322.89 sec
2016-07-11 21:39:07,889 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU
344.8 sec
2016-07-11 21:39:16,151 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU
367.77 sec
2016-07-11 21:39:25,456 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU
391.82 sec
2016-07-11 21:39:33,725 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU
415.48 sec
2016-07-11 21:39:43,037 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU
436.09 sec
2016-07-11 21:39:51,292 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
459.4 sec
2016-07-11 21:39:59,563 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU
477.92 sec
2016-07-11 21:40:05,760 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
491.72 sec
2016-07-11 21:40:10,921 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
499.37 sec
MapReduce Total cumulative CPU time: 8 minutes 19 seconds 370 msec
Ended Job = job_1468226887

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Gopal Vijayaraghavan

> Status: Finished successfully in 14.12 seconds
> OK
> 1
> Time taken: 14.38 seconds, Fetched: 1 row(s)

That might be an improvement over MR, but that still feels far too slow.


Parquet numbers are in general bad in Hive, but that's because the Parquet
reader gets no actual love from the devs. The community, if it wants to
keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and
cut a significant number of memory copies out of the reader.

The Spark 2.0 build for instance, has a custom Parquet reader for SparkSQL
which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does for
ORC (actually, it looks more like hive's VectorizedRowBatch than
Tungsten's flat layouts).

But that reader cannot be used in Hive-on-Spark, because it is not a
public reader impl.


Not to pick an arbitrary dataset, my workhorse example is a TPC-H lineitem
at 10Gb scale with a single 16 box.

hive(tpch_flat_orc_10)> select max(l_discount) from lineitem;
Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id
application_1466700718395_0256)

---
---
VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING
PENDING  FAILED  KILLED
---
---
Map 1 ..  llap SUCCEEDED 13 130
0   0   0  
Reducer 2 ..  llap SUCCEEDED  1  10
0   0   0  
---
---
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 0.71 s

---
---
Status: DAG finished successfully in 0.71 seconds

Query Execution Summary
---
---
OPERATIONDURATION
---
---
Compile Query   0.21s
Prepare Plan0.13s
Submit Plan 0.34s
Start DAG   0.23s
Run DAG 0.71s
---
---

Task Execution Summary
---
---
  VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
OUTPUT_RECORDS
---
---
 Map 1 604.00 00 59,957,438
  13
 Reducer 2 105.00 00 13
   0
---
---

LLAP IO Summary
---
---
  VERTICES ROWGROUPS  META_HIT  META_MISS  DATA_HIT  DATA_MISS  ALLOCATION
USED  TOTAL_IO
---
---
 Map 1  6036 01460B68.86MB491.00MB
479.89MB 7.94s
---
---

OK
0.1
Time taken: 1.669 seconds, Fetched: 1 row(s)
hive(tpch_flat_orc_10)>


This is running against a single 16 core box & I would assume it would
take <1.4s to read twice as much (13 tasks is barely touching the load
factors).

It would probably be a bit faster if the cache had hits, but in general
14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is.

Cheers,
Gopal













Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Appreciate all the comments.

Hive on Spark. Spark runs as an execution engine and is only used when you
query Hive. Otherwise it is not running. I run it in Yarn client mode. let
me show you an example

In hive-site xml set the execution engine to be spark to spark. It requires
some configuration but it does work :)

Alternatively log in to hive and do the setting there


set hive.execution.engine=spark;
set spark.home=/usr/lib/spark-1.3.1-bin-hadoop2.6;
set spark.master=yarn-client;
set spark.executor.memory=3g;
set spark.driver.memory=3g;
set spark.executor.cores=8;
set spark.ui.port=;

Small test ride

First using Hive 2 on Spark 1.3.1 to find max(id) for a 100million rows
parquet table

hive> select max(id) from oraclehadoop.dummy_parquet;

Starting Spark Job = a7752b2b-d73a-45de-aced-ddf02810938d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-11 17:41:52,386 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-11 17:41:55,409 Stage-2_0: 1(+8)/24 Stage-3_0: 0/1
2016-07-11 17:41:56,420 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-11 17:41:58,434 Stage-2_0: 10(+2)/24Stage-3_0: 0/1
2016-07-11 17:41:59,440 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-11 17:42:01,455 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-11 17:42:02,462 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-11 17:42:04,476 Stage-2_0: 23(+1)/24Stage-3_0: 0/1
2016-07-11 17:42:05,483 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
Finished

*Status: Finished successfully in 14.12 seconds*OK
1
Time taken: 14.38 seconds, Fetched: 1 row(s)

--simply switch the engine in hive to MR

hive>
*set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may
not be available in the future versions. Consider using a different
execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive> select max(id) from oraclehadoop.dummy_parquet;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.
Starting Job = job_1468226887011_0005, Tracking URL =
http://rhes564:8088/proxy/application_1468226887011_0005/
Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1468226887011_0005
Hadoop job information for Stage-1: number of mappers: 24; number of
reducers: 1
2016-07-11 17:42:46,904 Stage-1 map = 0%,  reduce = 0%
2016-07-11 17:42:56,328 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU
31.76 sec
2016-07-11 17:43:05,676 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU
61.78 sec
2016-07-11 17:43:16,091 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU
95.44 sec
2016-07-11 17:43:24,419 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU
121.6 sec
2016-07-11 17:43:32,734 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU
149.37 sec
2016-07-11 17:43:41,031 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU
177.62 sec
2016-07-11 17:43:48,305 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU
204.92 sec
2016-07-11 17:43:56,580 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU
235.34 sec
2016-07-11 17:44:05,917 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU
262.18 sec
2016-07-11 17:44:14,222 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU
286.21 sec
2016-07-11 17:44:22,502 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU
310.34 sec
2016-07-11 17:44:32,923 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
346.26 sec
2016-07-11 17:44:43,301 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU
379.11 sec
2016-07-11 17:44:53,674 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU
417.9 sec
2016-07-11 17:45:04,001 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU
450.73 sec
2016-07-11 17:45:13,327 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU
476.7 sec
2016-07-11 17:45:22,656 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU
508.97 sec
2016-07-11 17:45:33,002 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU
535.69 sec
2016-07-11 17:45:43,355 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU
573.33 sec
2016-07-11 17:45:52,613 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU
605.01 sec
2016-07-11 17:46:02,962 Stage-1 map = 88%,  reduce = 0%, Cumulative CPU
632.38 sec
2016-07-11 17:46:13,316 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU
666.45 sec
2016-07-11 17:46:23,656 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU
693.72 sec
2016-07-11 17:46:31,919 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
714.71 sec
2016-07-11 17:46:36,060 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
721.83 sec
MapReduce Total cumulative CPU time: 12 minutes 1 seconds 830 msec
Ended Job = job_1468226887011_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 24  Reduce: 1   Cumulative CPU: 721.83 sec   HDFS Read:
400442823 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 12 minutes 1 seconds 830 msec
OK
1
*Time taken: 239.532 seconds, Fetched: 1 row(s)*


I leave it t

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. 

Tez is ‘vendor’ independent.  ;-) 

Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the deck in 
their favor. 

Drill could be in the same boat, although there now more committers who are not 
working for MapR. I’m not sure who outside of HW is supporting Tez. 

But I digress. 

Here in the Spark user list, I have to ask how do you run hive on spark? Is the 
execution engine … the spark context always running? (Client mode I assume) 
Are the executors always running?   Can you run multiple queries from multiple 
users in parallel? 

These are some of the questions that should be asked and answered when 
considering how viable spark is going to be as the engine under Hive… 

Thx

-Mike

> On May 29, 2016, at 3:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, mainly ORC tables.
>>> 
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>> combination that works. Granted it helps to use Hive 2 on Spark

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
I think llap should be in the future a general component so llap + spark can 
make sense. I see tez and spark not as competitors but they have different 
purposes. Hive+Tez+llap is not the same as hive+spark. I think it goes beyond 
that for interactive queries .
Tez - you should use a distribution (eg Hortonworks) - generally I would use a 
distribution for anything related to performance , testing etc. because doing 
an own installation is more complex and more difficult to maintain. Performance 
and also features will be less good if you do not use a distribution. Which one 
is up to your choice.

> On 11 Jul 2016, at 17:09, Mich Talebzadeh  wrote:
> 
> The presentation will go deeper into the topic. Otherwise some thoughts  of 
> mine. Fell free to comment. criticise :) 
> 
> I am a member of Spark Hive and Tez user groups plus one or two others
> Spark is by far the biggest in terms of community interaction
> Tez, typically one thread in a month
> Personally started building Tez for Hive from Tez source and gave up as it 
> was not working. This was my own build as opposed to a distro
> if Hive says you should use Spark or Tez then using Spark is a perfectly 
> valid choice
> If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the bonnet 
> why bother.
> Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc. but 
> they are a bit dated (not being unkind) and cannot be taken as is today. One 
> their concern if I recall was excessive CPU and memory usage of Spark but 
> then with the same token LLAP will add additional need for resources
> Essentially I am more comfortable to use less of technology stack than more.  
> With Hive and Spark (in this context) we have two. With Hive, Tez and LLAP, 
> we have three stacks to look after that add to skill cost as well.
> Yep. It is still good to keep it simple
> 
> My thoughts on this are that if you have a viable open source product like 
> Spark which is becoming a sort of Vogue in Big Data space and moving very 
> fast, why look for another one. Hive does what it says on the Tin and good 
> reliable Data Warehouse.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 11 July 2016 at 15:22, Ashok Kumar  wrote:
>> Hi Mich,
>> 
>> Your recent presentation in London on this topic "Running Spark on Hive or 
>> Hive on Spark"
>> 
>> Have you made any more interesting findings that you like to bring up?
>> 
>> If Hive is offering both Spark and Tez in addition to MR, what stopping one 
>> not to use Spark? I still don't get why TEZ + LLAP is going to be a better 
>> choice from what you mentioned?
>> 
>> thanking you 
>> 
>> 
>> 
>> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
>> wrote:
>> 
>> 
>> Couple of points if I may and kindly bear with my remarks.
>> 
>> Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
>> 
>> "Sub-second queries require fast query execution and low setup cost. The 
>> challenge for Hive is to achieve this without giving up on the scale and 
>> flexibility that users depend on. This requires a new approach using a 
>> hybrid engine that leverages Tez and something new called  LLAP (Live Long 
>> and Process, #llap online).
>> 
>> LLAP is an optional daemon process running on multiple nodes, that provides 
>> the following:
>> Caching and data reuse across queries with compressed columnar data 
>> in-memory (off-heap)
>> Multi-threaded execution including reads with predicate pushdown and hash 
>> joins
>> High throughput IO using Async IO Elevator with dedicated thread and core 
>> per disk
>> Granular column level security across applications
>> "
>> OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
>> words what Spark does already and BTW it does not require a daemon running 
>> on any host. Don't take me wrong. It is interesting but this sounds to me 
>> (without testing myself) adding caching capability to TEZ to bring it on par 
>> with SPARK.
>> 
>> Remember:
>> 
>> Spark -> DAG + in-memory caching
>> TEZ = MR on DAG
>> TEZ + LLAP => DAG + in-memory caching
>> 
>> OK it is another way getting the same result. However, my concerns:
>> 
>> Spark has a wide user base. I judge this from Spark user group traffic
>> TEZ user group has no traffic I am afraid
>> LLAP I don't know
>> Sounds like Hortonworks promote TEZ and Cloudera does not want to know 
>> anything about Hive. and they promote Impala but that sounds like a sinking 
>> ship these days.
>> 
>> Having said that I will tr

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
The presentation will go deeper into the topic. Otherwise some thoughts  of
mine. Fell free to comment. criticise :)


   1. I am a member of Spark Hive and Tez user groups plus one or two others
   2. Spark is by far the biggest in terms of community interaction
   3. Tez, typically one thread in a month
   4. Personally started building Tez for Hive from Tez source and gave up
   as it was not working. This was my own build as opposed to a distro
   5. if Hive says you should use Spark or Tez then using Spark is a
   perfectly valid choice
   6. If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the
   bonnet why bother.
   7. Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc.
   but they are a bit dated (not being unkind) and cannot be taken as is
   today. One their concern if I recall was excessive CPU and memory usage of
   Spark but then with the same token LLAP will add additional need for
   resources
   8. Essentially I am more comfortable to use less of technology stack
   than more.  With Hive and Spark (in this context) we have two. With Hive,
   Tez and LLAP, we have three stacks to look after that add to skill cost as
   well.
   9. Yep. It is still good to keep it simple


My thoughts on this are that if you have a viable open source product like
Spark which is becoming a sort of Vogue in Big Data space and moving very
fast, why look for another one. Hive does what it says on the Tin and good
reliable Data Warehouse.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 15:22, Ashok Kumar  wrote:

> Hi Mich,
>
> Your recent presentation in London on this topic "Running Spark on Hive or
> Hive on Spark"
>
> Have you made any more interesting findings that you like to bring up?
>
> If Hive is offering both Spark and Tez in addition to MR, what stopping
> one not to use Spark? I still don't get why TEZ + LLAP is going to be a
> better choice from what you mentioned?
>
> thanking you
>
>
>
> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh 
> wrote:
>
>
> Couple of points if I may and kindly bear with my remarks.
>
> Whilst it will be very interesting to try TEZ with LLAP. As I read from
> LLAP
>
> "Sub-second queries require fast query execution and low setup cost. The
> challenge for Hive is to achieve this without giving up on the scale and
> flexibility that users depend on. This requires a new approach using a
> hybrid engine that leverages Tez and something new called  LLAP (Live Long
> and Process, #llap online).
>
> LLAP is an optional daemon process running on multiple nodes, that
> provides the following:
>
>- Caching and data reuse across queries with compressed columnar data
>in-memory (off-heap)
>- Multi-threaded execution including reads with predicate pushdown and
>hash joins
>- High throughput IO using Async IO Elevator with dedicated thread and
>core per disk
>- Granular column level security across applications
>- "
>
> OK so we have added an in-memory capability to TEZ by way of LLAP, In
> other words what Spark does already and BTW it does not require a daemon
> running on any host. Don't take me wrong. It is interesting but this sounds
> to me (without testing myself) adding caching capability to TEZ to bring it
> on par with SPARK.
>
> Remember:
>
> Spark -> DAG + in-memory caching
> TEZ = MR on DAG
> TEZ + LLAP => DAG + in-memory caching
>
> OK it is another way getting the same result. However, my concerns:
>
>
>- Spark has a wide user base. I judge this from Spark user group
>traffic
>- TEZ user group has no traffic I am afraid
>- LLAP I don't know
>
> Sounds like Hortonworks promote TEZ and Cloudera does not want to know
> anything about Hive. and they promote Impala but that sounds like a sinking
> ship these days.
>
> Having said that I will try TEZ + LLAP :) No pun intended
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 31 May 2016 at 08:19, Jörn Franke  wrote:
>
> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan 
> wrote:
> >
> >
> >> That being said all systems are evolving. Hive supports tez+llap which
> >> is basically the in-memory support.
> >
> > There is a big 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Ashok Kumar
Hi Mich,
Your recent presentation in London on this topic "Running Spark on Hive or Hive 
on Spark"
Have you made any more interesting findings that you like to bring up?
If Hive is offering both Spark and Tez in addition to MR, what stopping one not 
to use Spark? I still don't get why TEZ + LLAP is going to be a better choice 
from what you mentioned?
thanking you 
 

On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh  
wrote:
 

 Couple of points if I may and kindly bear with my remarks. 
Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP
"Sub-second queries require fast query execution and low setup cost. The 
challenge for Hive is to achieve this without giving up on the scale and 
flexibility that users depend on. This requires a new approach using a hybrid 
engine that leverages Tez and something new called  LLAP (Live Long and 
Process, #llap online).
LLAP is an optional daemon process running on multiple nodes, that provides the 
following:   
   - Caching and data reuse across queries with compressed columnar data 
in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and hash 
joins
   - High throughput IO using Async IO Elevator with dedicated thread and core 
per disk
   - Granular column level security across applications
   - "
OK so we have added an in-memory capability to TEZ by way of LLAP, In other 
words what Spark does already and BTW it does not require a daemon running on 
any host. Don't take me wrong. It is interesting but this sounds to me (without 
testing myself) adding caching capability to TEZ to bring it on par with SPARK. 
Remember:
Spark -> DAG + in-memory cachingTEZ = MR on DAGTEZ + LLAP => DAG + in-memory 
caching
OK it is another way getting the same result. However, my concerns:
   
   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know
Sounds like Hortonworks promote TEZ and Cloudera does not want to know anything 
about Hive. and they promote Impala but that sounds like a sinking ship these 
days.
Having said that I will try TEZ + LLAP :) No pun intended
Regards
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 31 May 2016 at 08:19, Jörn Franke  wrote:

Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
>
>
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
>
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
>
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
>
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
>
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
>
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
>
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
>
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
>
>
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
>
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
>
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
>
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
>
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
>
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data
> for availability/performance.
>
> Since data stream tend to be extr

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Thanks Gopal.

SAP Replication server (SRS) does it to Hive real time as well. That is the
main advantage of replication as it is real time. Picks up committed data
from the log and sends it to hive as well. Also it ois way ahead of Sqoop
that only does the initial load really.  It does 10k rows at a time with
insert into Hive table. Hive table cannot be transactional to start with.

I. 2016/04/08 09:38:23. REPLICATE Replication Server: Dropped subscription
<102_105_t> for replication definition <102_t> with replicate at

I. 2016/04/08 09:38:31. REPLICATE Replication Server: Creating subscription
<102_105_t> for replication definition <102_t> with replicate at

I. 2016/04/08 09:38:31. PRIMARY Replication Server: Creating subscription
<102_105_t> for replication definition <102_t> with replicate at

T. 2016/04/08 09:38:32. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:32. (84): 'begin transaction  '
T. 2016/04/08 09:38:32. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:32. (84): 'select  count (*) from t  '
T. 2016/04/08 09:38:34. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:34. (84): 'select OWNER, OBJECT_NAME, SUBOBJECT_NAME,
OBJECT_ID, DATA_OBJECT_ID, OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP2,
STATUS, TEMPORARY2, GENERATED, SECONDARY, NAMESPACE, EDITION_NA
ME, PADDING1, PADDING2, ATTRIBUTE from t  '
T. 2016/04/08 09:39:54. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:39:54. (86): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:12. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:12. (89): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:34. (87): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:34. (87): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:52. (88): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:52. (88): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:41:11. (90): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:41:11. (90): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:41:56. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:41:56. (86): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:42:30. (87): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:42:30. (87): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:42:53. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:42:53. (89): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:43:14. (90): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:43:14. (90): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:43:33. (88): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:43:33. (88): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:44:25. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:44:25. (86): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:44:44. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:44:44. (89): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:45:37. (90): Command sent to 'hiveserver2.asehadoop':

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 22:18, Gopal Vijayaraghavan  wrote:

>
> > Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS.
>
> No, LLAP intermediates HDFS. It holds column & index data streams as-is
> (i.e dictionary encoding, RLE, bloom filters etc are preserved).
>
> Because it does not cache row-tuples, it cannot exist as a caching tool
> for another RDBMS.
>
> I have heard of Oracle GoldenGate replicating into Hive, but it is not
> without its own pains of schema compat.
>
> Cheers,
> Gopal
>
>
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Gopal Vijayaraghavan

> Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS.

No, LLAP intermediates HDFS. It holds column & index data streams as-is
(i.e dictionary encoding, RLE, bloom filters etc are preserved).

Because it does not cache row-tuples, it cannot exist as a caching tool
for another RDBMS.

I have heard of Oracle GoldenGate replicating into Hive, but it is not
without its own pains of schema compat.

Cheers,
Gopal





Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Thanks for that Gopal.

Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS.

In that case does it use JDBC to get the data out from the underlying DB?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 21:48, Gopal Vijayaraghavan  wrote:

>
> > but this sounds to me (without testing myself) adding caching capability
> >to TEZ to bring it on par with SPARK.
>
> Nope, that was the crux of the earlier email.
>
> "Caching" seems to be catch-all term misused in that comparison.
>
> >> There is a big difference between where LLAP & SparkSQL, which has to do
> >> with access pattern needs.
>
> On another note, LLAP can actually be used inside Spark as well, just use
> LlapContext instead of HiveContext.
>
>
> <
> http://www.slideshare.net/HadoopSummit/llap-subsecond-analytical-queries-i
> n-hive/30>
>
>
> I even have a Postgres FDW for LLAP, which is mostly used for analytics
> web dashboards which are hooked into Hive.
>
> https://github.com/t3rmin4t0r/llap_fdw
>
>
> LLAP can do 200-400ms queries, but Postgres can get to the sub 10ms when
> it comes to slicing-dicing result sets <100k rows.
>
> Cheers,
> Gopal
>
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Gopal Vijayaraghavan

> but this sounds to me (without testing myself) adding caching capability
>to TEZ to bring it on par with SPARK.

Nope, that was the crux of the earlier email.

"Caching" seems to be catch-all term misused in that comparison.

>> There is a big difference between where LLAP & SparkSQL, which has to do
>> with access pattern needs.

On another note, LLAP can actually be used inside Spark as well, just use
LlapContext instead of HiveContext.





I even have a Postgres FDW for LLAP, which is mostly used for analytics
web dashboards which are hooked into Hive.

https://github.com/t3rmin4t0r/llap_fdw


LLAP can do 200-400ms queries, but Postgres can get to the sub 10ms when
it comes to slicing-dicing result sets <100k rows.

Cheers,
Gopal




Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Couple of points if I may and kindly bear with my remarks.

Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP

"Sub-second queries require fast query execution and low setup cost. The
challenge for Hive is to achieve this without giving up on the scale and
flexibility that users depend on. This requires a new approach using a
hybrid engine that leverages Tez and something new called  LLAP (Live Long
and Process, #llap online).

LLAP is an optional daemon process running on multiple nodes, that provides
the following:

   - Caching and data reuse across queries with compressed columnar data
   in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and
   hash joins
   - High throughput IO using Async IO Elevator with dedicated thread and
   core per disk
   - Granular column level security across applications
   - "

OK so we have added an in-memory capability to TEZ by way of LLAP, In other
words what Spark does already and BTW it does not require a daemon running
on any host. Don't take me wrong. It is interesting but this sounds to me
(without testing myself) adding caching capability to TEZ to bring it on
par with SPARK.

Remember:

Spark -> DAG + in-memory caching
TEZ = MR on DAG
TEZ + LLAP => DAG + in-memory caching

OK it is another way getting the same result. However, my concerns:


   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know

Sounds like Hortonworks promote TEZ and Cloudera does not want to know
anything about Hive. and they promote Impala but that sounds like a sinking
ship these days.

Having said that I will try TEZ + LLAP :) No pun intended

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 08:19, Jörn Franke  wrote:

> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan 
> wrote:
> >
> >
> >> That being said all systems are evolving. Hive supports tez+llap which
> >> is basically the in-memory support.
> >
> > There is a big difference between where LLAP & SparkSQL, which has to do
> > with access pattern needs.
> >
> > The first one is related to the lifetime of the cache - the Spark RDD
> > cache is per-user-session which allows for further operation in that
> > session to be optimized.
> >
> > LLAP is designed to be hammered by multiple user sessions running
> > different queries, designed to automate the cache eviction & selection
> > process. There's no user visible explicit .cache() to remember - it's
> > automatic and concurrent.
> >
> > My team works with both engines, trying to improve it for ORC, but the
> > goals of both are different.
> >
> > I will probably have to write a proper academic paper & get it
> > edited/reviewed instead of send my ramblings to the user lists like this.
> > Still, this needs an example to talk about.
> >
> > To give a qualified example, let's leave the world of single use clusters
> > and take the use-case detailed here
> >
> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> >
> >
> > There are two distinct problems there - one is that a single day sees
> upto
> > 100k independent user sessions running queries and that most queries
> cover
> > the last hour (& possibly join/compare against a similar hour aggregate
> > from the past).
> >
> > The problem with having independent 100k user-sessions from different
> > connections was that the SparkSQL layer drops the RDD lineage & cache
> > whenever a user ends a session.
> >
> > The scale problem in general for Impala was that even though the data
> size
> > was in multiple terabytes, the actual hot data was approx <20Gb, which
> > resides on <10 machines with locality.
> >
> > The same problem applies when you apply RDD caching with something like
> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> > popular that the machines which hold those blocks run extra hot.
> >
> > A cache model per-user session is entirely wasteful and a common cache +
> > MPP model effectively overloads 2-3% of cluster, while leaving the other
> > machines idle.
> >
> > LLAP was designed specifically to prevent that hotspotting, while
> > maintaining the common cache model - within a few minutes after an hour
> > ticks over, the whole cluster develops temporal popularity for the hot
> > data and nearly every rack has at least one cached copy of the same data
> > for availability/performance.
> >
> > Since data stream tend to be extremely wide table (Omniture) comes to
> > mine, so the cache actually does not hold all columns in a table and
> since
> > Zipf distributions are extremely common in these real data sets, the
> cache
> > does not hol

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Jörn Franke
Thanks very interesting explanation. Looking forward to test it.

> On 31 May 2016, at 07:51, Gopal Vijayaraghavan  wrote:
> 
> 
>> That being said all systems are evolving. Hive supports tez+llap which
>> is basically the in-memory support.
> 
> There is a big difference between where LLAP & SparkSQL, which has to do
> with access pattern needs.
> 
> The first one is related to the lifetime of the cache - the Spark RDD
> cache is per-user-session which allows for further operation in that
> session to be optimized.
> 
> LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent.
> 
> My team works with both engines, trying to improve it for ORC, but the
> goals of both are different.
> 
> I will probably have to write a proper academic paper & get it
> edited/reviewed instead of send my ramblings to the user lists like this.
> Still, this needs an example to talk about.
> 
> To give a qualified example, let's leave the world of single use clusters
> and take the use-case detailed here
> 
> http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> 
> 
> There are two distinct problems there - one is that a single day sees upto
> 100k independent user sessions running queries and that most queries cover
> the last hour (& possibly join/compare against a similar hour aggregate
> from the past).
> 
> The problem with having independent 100k user-sessions from different
> connections was that the SparkSQL layer drops the RDD lineage & cache
> whenever a user ends a session.
> 
> The scale problem in general for Impala was that even though the data size
> was in multiple terabytes, the actual hot data was approx <20Gb, which
> resides on <10 machines with locality.
> 
> The same problem applies when you apply RDD caching with something like
> un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> popular that the machines which hold those blocks run extra hot.
> 
> A cache model per-user session is entirely wasteful and a common cache +
> MPP model effectively overloads 2-3% of cluster, while leaving the other
> machines idle.
> 
> LLAP was designed specifically to prevent that hotspotting, while
> maintaining the common cache model - within a few minutes after an hour
> ticks over, the whole cluster develops temporal popularity for the hot
> data and nearly every rack has at least one cached copy of the same data
> for availability/performance.
> 
> Since data stream tend to be extremely wide table (Omniture) comes to
> mine, so the cache actually does not hold all columns in a table and since
> Zipf distributions are extremely common in these real data sets, the cache
> does not hold all rows either.
> 
> select count(clicks) from table where zipcode = 695506;
> 
> with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
> the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
> indexes for all files will be loaded into memory, all misses on the bloom
> will not even feature in the cache.
> 
> A subsequent query for
> 
> select count(clicks) from table where zipcode = 695586;
> 
> will run against the collected indexes, before deciding which files need
> to be loaded into cache.
> 
> 
> Then again, 
> 
> select count(clicks)/count(impressions) from table where zipcode = 695586;
> 
> will load only impressions out of the table into cache, to add it to the
> columnar cache without producing another complete copy (RDDs are not
> mutable, but LLAP cache is additive).
> 
> The column split cache & index-cache separation allows for this to be
> cheaper than a full rematerialization - both are evicted as they fill up,
> with different priorities.
> 
> Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
> with a bit of input from UX patterns observed from Tableau/Microstrategy
> users to give it the impression of being much faster than the engine
> really can be.
> 
> Illusion of performance is likely to be indistinguishable from actual -
> I'm actually looking for subjects for that experiment :)
> 
> Cheers,
> Gopal
> 
> 


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Gopal Vijayaraghavan

> That being said all systems are evolving. Hive supports tez+llap which
>is basically the in-memory support.

There is a big difference between where LLAP & SparkSQL, which has to do
with access pattern needs.

The first one is related to the lifetime of the cache - the Spark RDD
cache is per-user-session which allows for further operation in that
session to be optimized.

LLAP is designed to be hammered by multiple user sessions running
different queries, designed to automate the cache eviction & selection
process. There's no user visible explicit .cache() to remember - it's
automatic and concurrent.

My team works with both engines, trying to improve it for ORC, but the
goals of both are different.

I will probably have to write a proper academic paper & get it
edited/reviewed instead of send my ramblings to the user lists like this.
Still, this needs an example to talk about.

To give a qualified example, let's leave the world of single use clusters
and take the use-case detailed here

http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/


There are two distinct problems there - one is that a single day sees upto
100k independent user sessions running queries and that most queries cover
the last hour (& possibly join/compare against a similar hour aggregate
from the past).

The problem with having independent 100k user-sessions from different
connections was that the SparkSQL layer drops the RDD lineage & cache
whenever a user ends a session.

The scale problem in general for Impala was that even though the data size
was in multiple terabytes, the actual hot data was approx <20Gb, which
resides on <10 machines with locality.

The same problem applies when you apply RDD caching with something like
un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
popular that the machines which hold those blocks run extra hot.

A cache model per-user session is entirely wasteful and a common cache +
MPP model effectively overloads 2-3% of cluster, while leaving the other
machines idle.

LLAP was designed specifically to prevent that hotspotting, while
maintaining the common cache model - within a few minutes after an hour
ticks over, the whole cluster develops temporal popularity for the hot
data and nearly every rack has at least one cached copy of the same data
for availability/performance.

Since data stream tend to be extremely wide table (Omniture) comes to
mine, so the cache actually does not hold all columns in a table and since
Zipf distributions are extremely common in these real data sets, the cache
does not hold all rows either.

select count(clicks) from table where zipcode = 695506;

with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
indexes for all files will be loaded into memory, all misses on the bloom
will not even feature in the cache.

A subsequent query for

select count(clicks) from table where zipcode = 695586;

will run against the collected indexes, before deciding which files need
to be loaded into cache.


Then again, 

select count(clicks)/count(impressions) from table where zipcode = 695586;

will load only impressions out of the table into cache, to add it to the
columnar cache without producing another complete copy (RDDs are not
mutable, but LLAP cache is additive).

The column split cache & index-cache separation allows for this to be
cheaper than a full rematerialization - both are evicted as they fill up,
with different priorities.

Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
with a bit of input from UX patterns observed from Tableau/Microstrategy
users to give it the impression of being much faster than the engine
really can be.

Illusion of performance is likely to be indistinguishable from actual -
I'm actually looking for subjects for that experiment :)

Cheers,
Gopal




Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Marcin Tustin
Mich - it sounds like maybe you should try these benchmarks with alluxio
abstracting the storage layer, and see how much it makes a difference.
Alluxio should (if I understand it right) provide a lot of the optimisation
you're looking for with in memory work.

I've never used it, but I would love to hear the experiences of people who
have.

On Mon, May 30, 2016 at 5:32 PM, Mich Talebzadeh 
wrote:

> I think we are going to move to a model that the computation stack will be
> separate from storage stack and moreover something like Hive that provides
> the means for persistent storage (well HDFS is the one that stores all the
> data) will have an in-memory type capability much like what Oracle TimesTen
> IMDB does with its big brother Oracle. Now TimesTen is effectively designed
> to provide in-memory capability for analytics for Oracle 12c. These two work 
> like
> an index or materialized view.  You write queries against tables -
> optimizer figures out whether to use row oriented storage and indexes to
> access (Oracle classic) or column non-indexed storage to answer (TimesTen).
> just one optimizer.
>
> I gather Hive will be like that eventually. it will decide based on the
> frequency of access where to look for data. Yes we may have 10 TB of data
> on disk but how much of it is frequently accessed (hot data). 80-20 rule?
> In reality may be just 2TB or most recent partitions etc. The rest is cold
> data.
>
> cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 21:59, Michael Segel  wrote:
>
>> And you have MapR supporting Apache Drill.
>>
>> So these are all alternatives to Spark, and its not necessarily an either
>> or scenario. You can have both.
>>
>> On May 30, 2016, at 12:49 PM, Mich Talebzadeh 
>> wrote:
>>
>> yep Hortonworks supports Tez for one reason or other which I am going
>> hopefully to test it as the query engine for hive. Tthough I think Spark
>> will be faster because of its in-memory support.
>>
>> Also if you are independent then you better off dealing with Spark and
>> Hive without the need to support another stack like Tez.
>>
>> Cloudera support Impala instead of Hive but it is not something I have
>> used. .
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 30 May 2016 at 20:19, Michael Segel  wrote:
>>
>>> Mich,
>>>
>>> Most people use vendor releases because they need to have the support.
>>> Hortonworks is the vendor who has the most skin in the game when it
>>> comes to Tez.
>>>
>>> If memory serves, Tez isn’t going to be M/R but a local execution
>>> engine? Then LLAP is the in-memory piece to speed up Tez?
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> thanks I think the problem is that the TEZ user group is exceptionally
>>> quiet. Just sent an email to Hive user group to see anyone has managed to
>>> built a vendor independent version.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>>>
 Well I think it is different from MR. It has some optimizations which
 you do not find in MR. Especially the LLAP option in Hive2 makes it
 interesting.

 I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it
 is integrated in the Hortonworks distribution.


 On 29 May 2016, at 21:43, Mich Talebzadeh 
 wrote:

 Hi Jorn,

 I started building apache-tez-0.8.2 but got few errors. Couple of guys
 from TEZ user group kindly gave a hand but I could not go very far (or may
 be I did not make enough efforts) making it work.

 That TEZ user group is very quiet as well.

 My understanding is TEZ is MR with DAG but of course Spark has both
 plus in-memory capability.

 It would be interesting to see what version of TEZ works as execution
 engine with Hive.

 Vendors are divided on this (use Hive with TEZ) or use Impala instead
 of Hive etc as I am sure you already know.

 Cheers,




 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
I think we are going to move to a model that the computation stack will be
separate from storage stack and moreover something like Hive that provides
the means for persistent storage (well HDFS is the one that stores all the
data) will have an in-memory type capability much like what Oracle TimesTen
IMDB does with its big brother Oracle. Now TimesTen is effectively designed
to provide in-memory capability for analytics for Oracle 12c. These
two work like
an index or materialized view.  You write queries against tables -
optimizer figures out whether to use row oriented storage and indexes to
access (Oracle classic) or column non-indexed storage to answer
(TimesTen). just
one optimizer.

I gather Hive will be like that eventually. it will decide based on the
frequency of access where to look for data. Yes we may have 10 TB of data
on disk but how much of it is frequently accessed (hot data). 80-20 rule?
In reality may be just 2TB or most recent partitions etc. The rest is cold
data.

cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 21:59, Michael Segel  wrote:

> And you have MapR supporting Apache Drill.
>
> So these are all alternatives to Spark, and its not necessarily an either
> or scenario. You can have both.
>
> On May 30, 2016, at 12:49 PM, Mich Talebzadeh 
> wrote:
>
> yep Hortonworks supports Tez for one reason or other which I am going
> hopefully to test it as the query engine for hive. Tthough I think Spark
> will be faster because of its in-memory support.
>
> Also if you are independent then you better off dealing with Spark and
> Hive without the need to support another stack like Tez.
>
> Cloudera support Impala instead of Hive but it is not something I have
> used. .
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 20:19, Michael Segel  wrote:
>
>> Mich,
>>
>> Most people use vendor releases because they need to have the support.
>> Hortonworks is the vendor who has the most skin in the game when it comes
>> to Tez.
>>
>> If memory serves, Tez isn’t going to be M/R but a local execution engine?
>> Then LLAP is the in-memory piece to speed up Tez?
>>
>> HTH
>>
>> -Mike
>>
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh 
>> wrote:
>>
>> thanks I think the problem is that the TEZ user group is exceptionally
>> quiet. Just sent an email to Hive user group to see anyone has managed to
>> built a vendor independent version.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>>
>>> Well I think it is different from MR. It has some optimizations which
>>> you do not find in MR. Especially the LLAP option in Hive2 makes it
>>> interesting.
>>>
>>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it
>>> is integrated in the Hortonworks distribution.
>>>
>>>
>>> On 29 May 2016, at 21:43, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi Jorn,
>>>
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>>> from TEZ user group kindly gave a hand but I could not go very far (or may
>>> be I did not make enough efforts) making it work.
>>>
>>> That TEZ user group is very quiet as well.
>>>
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus
>>> in-memory capability.
>>>
>>> It would be interesting to see what version of TEZ works as execution
>>> engine with Hive.
>>>
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
>>> Hive etc as I am sure you already know.
>>>
>>> Cheers,
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>>>
 Very interesting do you plan also a test with TEZ?

 On 29 May 2016, at 13:40, Mich Talebzadeh 
 wrote:

 Hi,

 I did another study of Hive using Spark engine compared to Hive with MR.

 Basically took the original table imported using Sqoop and created and
 populated a new ORC table partitioned by year and month into 48 partitions
 as follows:

 
 ​
 Connections use JDBC via beeline. Now for each partition using MR it
 takes an av

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
And you have MapR supporting Apache Drill. 

So these are all alternatives to Spark, and its not necessarily an either or 
scenario. You can have both. 

> On May 30, 2016, at 12:49 PM, Mich Talebzadeh  
> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 30 May 2016 at 20:19, Michael Segel  > wrote:
> Mich, 
> 
> Most people use vendor releases because they need to have the support. 
> Hortonworks is the vendor who has the most skin in the game when it comes to 
> Tez. 
> 
> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
> Then LLAP is the in-memory piece to speed up Tez? 
> 
> HTH
> 
> -Mike
> 
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh > > wrote:
>> 
>> thanks I think the problem is that the TEZ user group is exceptionally 
>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>> built a vendor independent version.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 21:23, Jörn Franke > > wrote:
>> Well I think it is different from MR. It has some optimizations which you do 
>> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>> 
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
>> integrated in the Hortonworks distribution. 
>> 
>> 
>> On 29 May 2016, at 21:43, Mich Talebzadeh > > wrote:
>> 
>>> Hi Jorn,
>>> 
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>>> did not make enough efforts) making it work.
>>> 
>>> That TEZ user group is very quiet as well.
>>> 
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>>> in-memory capability.
>>> 
>>> It would be interesting to see what version of TEZ works as execution 
>>> engine with Hive.
>>> 
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>>> Hive etc as I am sure you already know.
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 29 May 2016 at 20:19, Jörn Franke >> > wrote:
>>> Very interesting do you plan also a test with TEZ?
>>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh >> > wrote:
>>> 
 Hi,
 
 I did another study of Hive using Spark engine compared to Hive with MR.
 
 Basically took the original table imported using Sqoop and created and 
 populated a new ORC table partitioned by year and month into 48 partitions 
 as follows:
 
 
 ​ 
 Connections use JDBC via beeline. Now for each partition using MR it takes 
 an average of 17 minutes as seen below for each PARTITION..  Now that is 
 just an individual partition and there are 48 partitions.
 
 In contrast doing the same operation with Spark engine took 10 minutes all 
 inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
 from below
 
 
 
 This is by no means indicate that Spark is much better than MR but shows 
 that some very good results can ve achieved using Spark engine.
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 
  
 http://talebzadehmich.wordpress.com 
  
 
 On 24 May 2016 at 08:03, Mich Talebzadeh >>> > wrote:
 Hi,
 
 We use Hive as the database and

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
I do not think that in-memory itself will make things faster in all cases. 
Especially if you use Tez with Orc or parquet. 
Especially for ad hoc queries on large dataset (indecently if they fit 
in-memory or not) this will have a significant impact. This is an experience I 
have also with the in-memory databases with Oracle or SQL server. It might 
sound surprising, but has some explanations. Orc and parquet have the min/max 
indexes, store and process data (important choose the right datatype, if 
everything is varchar then it is your fault that the database is not 
performing) very efficiently, only load into memory what is needed. This is not 
the case for in-memory systems. Usually everything is loaded in memory and not 
only the parts which are needed. This means due to the absence of min max 
indexes you have to go through everything. Let us assume the table has a size 
of 10 TB. There are different ad hoc queries that only process 1 gb (each one 
addresses different areas). In hive+tez this is currently rather efficient: you 
load 1 gb (negligible in a cluster) and process 1 gb.  In spark you would cache 
10 tb (you do not know which can part will be addressed) which takes a lot of 
time to first load and each query needs to go in memory through 10 tb. This 
might be an extreme case, but it is not uncommon. An exception are of course 
machine learning algorithms (the original purpose of Spark), where I see more 
advantages for Spark. Most of the traditional companies have probably both use 
cases (maybe with a bias towards the first). Internet companies have more 
towards the last.

That being said all systems are evolving. Hive supports tez+llap which is 
basically the in-memory support. Spark stores the data more efficient in 1.5 
and 1.6 (in the dataset Api and dataframe - issue here that it is not the same 
format as the files from disk). Let's see if there will be a convergence - my 
bet is that both systems will be used optimized for their use cases.

The bottom line is you have to first optimize and think what you need to do 
before going in-memory. Never load everything in-memory. You will be surprised. 
Have multiple technologies in your ecosystem. Understand them. Unfortunately 
most of the consultant companies have only poor experience and understanding of 
the complete picture and thus they fail with both technologies, which is sad, 
because both can be extremely powerful and a competitive  advantage.

> On 30 May 2016, at 21:49, Mich Talebzadeh  wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 30 May 2016 at 20:19, Michael Segel  wrote:
>> Mich, 
>> 
>> Most people use vendor releases because they need to have the support. 
>> Hortonworks is the vendor who has the most skin in the game when it comes to 
>> Tez. 
>> 
>> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
>> Then LLAP is the in-memory piece to speed up Tez? 
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh  
>>> wrote:
>>> 
>>> thanks I think the problem is that the TEZ user group is exceptionally 
>>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>>> built a vendor independent version.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 29 May 2016 at 21:23, Jörn Franke  wrote:
 Well I think it is different from MR. It has some optimizations which you 
 do not find in MR. Especially the LLAP option in Hive2 makes it 
 interesting. 
 
 I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
 integrated in the Hortonworks distribution. 
 
 
> On 29 May 2016, at 21:43, Mich Talebzadeh  
> wrote:
> 
> Hi Jorn,
> 
> I started building apache-tez-0.8.2 but got few errors. Couple of guys 
> from TEZ user group kindly gave a hand but I could not go very far (or 
> may be I did not make enough efforts) making it work.
> 
> That TEZ user group is very quiet as well.
> 
> My understanding is TEZ is MR with DAG but of course Spark has both plus 
> in-memory capability.
> 
> It would be interesting to see what version of TEZ works as execution 
> engine with Hive.
> 
> Vendors are divided on this (use Hive with TEZ) or use Imp

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
yep Hortonworks supports Tez for one reason or other which I am going
hopefully to test it as the query engine for hive. Tthough I think Spark
will be faster because of its in-memory support.

Also if you are independent then you better off dealing with Spark and Hive
without the need to support another stack like Tez.

Cloudera support Impala instead of Hive but it is not something I have
used. .

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 20:19, Michael Segel  wrote:

> Mich,
>
> Most people use vendor releases because they need to have the support.
> Hortonworks is the vendor who has the most skin in the game when it comes
> to Tez.
>
> If memory serves, Tez isn’t going to be M/R but a local execution engine?
> Then LLAP is the in-memory piece to speed up Tez?
>
> HTH
>
> -Mike
>
> On May 29, 2016, at 1:35 PM, Mich Talebzadeh 
> wrote:
>
> thanks I think the problem is that the TEZ user group is exceptionally
> quiet. Just sent an email to Hive user group to see anyone has managed to
> built a vendor independent version.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>
>> Well I think it is different from MR. It has some optimizations which you
>> do not find in MR. Especially the LLAP option in Hive2 makes it
>> interesting.
>>
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is
>> integrated in the Hortonworks distribution.
>>
>>
>> On 29 May 2016, at 21:43, Mich Talebzadeh 
>> wrote:
>>
>> Hi Jorn,
>>
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>> from TEZ user group kindly gave a hand but I could not go very far (or may
>> be I did not make enough efforts) making it work.
>>
>> That TEZ user group is very quiet as well.
>>
>> My understanding is TEZ is MR with DAG but of course Spark has both plus
>> in-memory capability.
>>
>> It would be interesting to see what version of TEZ works as execution
>> engine with Hive.
>>
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
>> Hive etc as I am sure you already know.
>>
>> Cheers,
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>>
>>> Very interesting do you plan also a test with TEZ?
>>>
>>> On 29 May 2016, at 13:40, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>
>>> Basically took the original table imported using Sqoop and created and
>>> populated a new ORC table partitioned by year and month into 48 partitions
>>> as follows:
>>>
>>> 
>>> ​
>>> Connections use JDBC via beeline. Now for each partition using MR it
>>> takes an average of 17 minutes as seen below for each PARTITION..  Now that
>>> is just an individual partition and there are 48 partitions.
>>>
>>> In contrast doing the same operation with Spark engine took 10 minutes
>>> all inclusive. I just gave up on MR. You can see the StartTime and
>>> FinishTime from below
>>>
>>> 
>>>
>>> This is by no means indicate that Spark is much better than MR but shows
>>> that some very good results can ve achieved using Spark engine.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 24 May 2016 at 08:03, Mich Talebzadeh 
>>> wrote:
>>>
 Hi,

 We use Hive as the database and use Spark as an all purpose query tool.

 Whether Hive is the write database for purpose or one is better off
 with something like Phoenix on Hbase, well the answer is it depends and
 your mileage varies.

 So fit for purpose.

 Ideally what wants is to use the fastest  method to get the results.
 How fast we confine it to our SLA agreements in production and that helps
 us from unnecessary further work as we technologists like to play around.

 So in short, we use Spark most of the time and use Hive as the backend
 engine for data storage, mainly ORC tables.

 We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
 combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
 at the 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, 

Most people use vendor releases because they need to have the support. 
Hortonworks is the vendor who has the most skin in the game when it comes to 
Tez. 

If memory serves, Tez isn’t going to be M/R but a local execution engine? Then 
LLAP is the in-memory piece to speed up Tez? 

HTH

-Mike

> On May 29, 2016, at 1:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions. 
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, mainly ORC tables.
>>> 
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>> at the moment it is one of my projects.
>>> 
>>> We do not use any vendor's products as it enables us to move away  from 
>>> being tied down after years of SAP, Oracle and MS dependency to yet another 
>>> vendor. Besides there is some politics going on with one promoting Tez and 
>>> another Spark as a backend. That is fine but obviously we prefer an 
>>> independent assessment ourselves.
>>> 
>>> My gut feeling 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Mich Talebzadeh
thanks I think the problem is that the TEZ user group is exceptionally
quiet. Just sent an email to Hive user group to see anyone has managed to
built a vendor independent version.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 21:23, Jörn Franke  wrote:

> Well I think it is different from MR. It has some optimizations which you
> do not find in MR. Especially the LLAP option in Hive2 makes it
> interesting.
>
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is
> integrated in the Hortonworks distribution.
>
>
> On 29 May 2016, at 21:43, Mich Talebzadeh 
> wrote:
>
> Hi Jorn,
>
> I started building apache-tez-0.8.2 but got few errors. Couple of guys
> from TEZ user group kindly gave a hand but I could not go very far (or may
> be I did not make enough efforts) making it work.
>
> That TEZ user group is very quiet as well.
>
> My understanding is TEZ is MR with DAG but of course Spark has both plus
> in-memory capability.
>
> It would be interesting to see what version of TEZ works as execution
> engine with Hive.
>
> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
> Hive etc as I am sure you already know.
>
> Cheers,
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>
>> Very interesting do you plan also a test with TEZ?
>>
>> On 29 May 2016, at 13:40, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I did another study of Hive using Spark engine compared to Hive with MR.
>>
>> Basically took the original table imported using Sqoop and created and
>> populated a new ORC table partitioned by year and month into 48 partitions
>> as follows:
>>
>> 
>> ​
>> Connections use JDBC via beeline. Now for each partition using MR it
>> takes an average of 17 minutes as seen below for each PARTITION..  Now that
>> is just an individual partition and there are 48 partitions.
>>
>> In contrast doing the same operation with Spark engine took 10 minutes
>> all inclusive. I just gave up on MR. You can see the StartTime and
>> FinishTime from below
>>
>> 
>>
>> This is by no means indicate that Spark is much better than MR but shows
>> that some very good results can ve achieved using Spark engine.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 24 May 2016 at 08:03, Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>>
>>> Whether Hive is the write database for purpose or one is better off with
>>> something like Phoenix on Hbase, well the answer is it depends and your
>>> mileage varies.
>>>
>>> So fit for purpose.
>>>
>>> Ideally what wants is to use the fastest  method to get the results. How
>>> fast we confine it to our SLA agreements in production and that helps us
>>> from unnecessary further work as we technologists like to play around.
>>>
>>> So in short, we use Spark most of the time and use Hive as the backend
>>> engine for data storage, mainly ORC tables.
>>>
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
>>> at the moment it is one of my projects.
>>>
>>> We do not use any vendor's products as it enables us to move away  from
>>> being tied down after years of SAP, Oracle and MS dependency to yet another
>>> vendor. Besides there is some politics going on with one promoting Tez and
>>> another Spark as a backend. That is fine but obviously we prefer an
>>> independent assessment ourselves.
>>>
>>> My gut feeling is that one needs to look at the use case. Recently we
>>> had to import a very large table from Oracle to Hive and decided to use
>>> Spark 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used
>>> JDBC connection with temp table and it was good. We could have used sqoop
>>> but decided to settle for Spark so it all depends on use case.
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 24 May 2016 at 03:11, ayan guha  wrote:
>>>
 Hi

 Thanks for very useful stats.

 Did you have any benc

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
Well I think it is different from MR. It has some optimizations which you do 
not find in MR. Especially the LLAP option in Hive2 makes it interesting. 

I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
integrated in the Hortonworks distribution. 


> On 29 May 2016, at 21:43, Mich Talebzadeh  wrote:
> 
> Hi Jorn,
> 
> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
> TEZ user group kindly gave a hand but I could not go very far (or may be I 
> did not make enough efforts) making it work.
> 
> That TEZ user group is very quiet as well.
> 
> My understanding is TEZ is MR with DAG but of course Spark has both plus 
> in-memory capability.
> 
> It would be interesting to see what version of TEZ works as execution engine 
> with Hive. 
> 
> Vendors are divided on this (use Hive with TEZ) or use Impala instead of Hive 
> etc as I am sure you already know.
> 
> Cheers,
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh  wrote:
>>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 24 May 2016 at 08:03, Mich Talebzadeh  wrote:
 Hi,
 
 We use Hive as the database and use Spark as an all purpose query tool.
 
 Whether Hive is the write database for purpose or one is better off with 
 something like Phoenix on Hbase, well the answer is it depends and your 
 mileage varies. 
 
 So fit for purpose.
 
 Ideally what wants is to use the fastest  method to get the results. How 
 fast we confine it to our SLA agreements in production and that helps us 
 from unnecessary further work as we technologists like to play around.
 
 So in short, we use Spark most of the time and use Hive as the backend 
 engine for data storage, mainly ORC tables.
 
 We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
 combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
 at the moment it is one of my projects.
 
 We do not use any vendor's products as it enables us to move away  from 
 being tied down after years of SAP, Oracle and MS dependency to yet 
 another vendor. Besides there is some politics going on with one promoting 
 Tez and another Spark as a backend. That is fine but obviously we prefer 
 an independent assessment ourselves.
 
 My gut feeling is that one needs to look at the use case. Recently we had 
 to import a very large table from Oracle to Hive and decided to use Spark 
 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
 connection with temp table and it was good. We could have used sqoop but 
 decided to settle for Spark so it all depends on use case.
 
 HTH
 
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
  
 
> On 24 May 2016 at 03:11, ayan guha  wrote:
> Hi
> 
> Thanks for very useful stats. 
> 
> Did you have any benchmark for using Spark as backend engine for Hive vs 
> using Spark thrift server (and run spark code for hive queries)? We are 
> using later but it will be very useful to remove thriftserver, if we can. 
> 
>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke  
>> wrote:
>> 
>> Hi Mich,
>> 
>> I think these comparisons are useful. One interesting aspect could be 
>> hardware scalability in this context. Additionally different type of 
>> computations. Furthermore, one could compare Spark and Tez+llap as 
>> execution engines. I have the gut feeling that  each one can be 
>> 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Mich Talebzadeh
Hi Jorn,

I started building apache-tez-0.8.2 but got few errors. Couple of guys from
TEZ user group kindly gave a hand but I could not go very far (or may be I
did not make enough efforts) making it work.

That TEZ user group is very quiet as well.

My understanding is TEZ is MR with DAG but of course Spark has both plus
in-memory capability.

It would be interesting to see what version of TEZ works as execution
engine with Hive.

Vendors are divided on this (use Hive with TEZ) or use Impala instead of
Hive etc as I am sure you already know.

Cheers,




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 20:19, Jörn Franke  wrote:

> Very interesting do you plan also a test with TEZ?
>
> On 29 May 2016, at 13:40, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I did another study of Hive using Spark engine compared to Hive with MR.
>
> Basically took the original table imported using Sqoop and created and
> populated a new ORC table partitioned by year and month into 48 partitions
> as follows:
>
> 
> ​
> Connections use JDBC via beeline. Now for each partition using MR it takes
> an average of 17 minutes as seen below for each PARTITION..  Now that is
> just an individual partition and there are 48 partitions.
>
> In contrast doing the same operation with Spark engine took 10 minutes all
> inclusive. I just gave up on MR. You can see the StartTime and FinishTime
> from below
>
> 
>
> This is by no means indicate that Spark is much better than MR but shows
> that some very good results can ve achieved using Spark engine.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 24 May 2016 at 08:03, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> We use Hive as the database and use Spark as an all purpose query tool.
>>
>> Whether Hive is the write database for purpose or one is better off with
>> something like Phoenix on Hbase, well the answer is it depends and your
>> mileage varies.
>>
>> So fit for purpose.
>>
>> Ideally what wants is to use the fastest  method to get the results. How
>> fast we confine it to our SLA agreements in production and that helps us
>> from unnecessary further work as we technologists like to play around.
>>
>> So in short, we use Spark most of the time and use Hive as the backend
>> engine for data storage, mainly ORC tables.
>>
>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
>> at the moment it is one of my projects.
>>
>> We do not use any vendor's products as it enables us to move away  from
>> being tied down after years of SAP, Oracle and MS dependency to yet another
>> vendor. Besides there is some politics going on with one promoting Tez and
>> another Spark as a backend. That is fine but obviously we prefer an
>> independent assessment ourselves.
>>
>> My gut feeling is that one needs to look at the use case. Recently we had
>> to import a very large table from Oracle to Hive and decided to use Spark
>> 1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC
>> connection with temp table and it was good. We could have used sqoop but
>> decided to settle for Spark so it all depends on use case.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 24 May 2016 at 03:11, ayan guha  wrote:
>>
>>> Hi
>>>
>>> Thanks for very useful stats.
>>>
>>> Did you have any benchmark for using Spark as backend engine for Hive vs
>>> using Spark thrift server (and run spark code for hive queries)? We are
>>> using later but it will be very useful to remove thriftserver, if we can.
>>>
>>> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke 
>>> wrote:
>>>

 Hi Mich,

 I think these comparisons are useful. One interesting aspect could be
 hardware scalability in this context. Additionally different type of
 computations. Furthermore, one could compare Spark and Tez+llap as
 execution engines. I have the gut feeling that  each one can be justified
 by different use cases.
 Nevertheless, there should be always a disclaimer for such comparisons,
 because Spark and Hive are not good for a lot of concurrent lookups of
 single rows. They are not good for frequently write small amounts of data
 (eg sensor data). Here hbase could be more interesting. Other use cases can
 justify graph databases, 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
Very interesting do you plan also a test with TEZ?

> On 29 May 2016, at 13:40, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I did another study of Hive using Spark engine compared to Hive with MR.
> 
> Basically took the original table imported using Sqoop and created and 
> populated a new ORC table partitioned by year and month into 48 partitions as 
> follows:
> 
> 
> ​ 
> Connections use JDBC via beeline. Now for each partition using MR it takes an 
> average of 17 minutes as seen below for each PARTITION..  Now that is just an 
> individual partition and there are 48 partitions.
> 
> In contrast doing the same operation with Spark engine took 10 minutes all 
> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
> from below
> 
> 
> 
> This is by no means indicate that Spark is much better than MR but shows that 
> some very good results can ve achieved using Spark engine.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 24 May 2016 at 08:03, Mich Talebzadeh  wrote:
>> Hi,
>> 
>> We use Hive as the database and use Spark as an all purpose query tool.
>> 
>> Whether Hive is the write database for purpose or one is better off with 
>> something like Phoenix on Hbase, well the answer is it depends and your 
>> mileage varies. 
>> 
>> So fit for purpose.
>> 
>> Ideally what wants is to use the fastest  method to get the results. How 
>> fast we confine it to our SLA agreements in production and that helps us 
>> from unnecessary further work as we technologists like to play around.
>> 
>> So in short, we use Spark most of the time and use Hive as the backend 
>> engine for data storage, mainly ORC tables.
>> 
>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but at 
>> the moment it is one of my projects.
>> 
>> We do not use any vendor's products as it enables us to move away  from 
>> being tied down after years of SAP, Oracle and MS dependency to yet another 
>> vendor. Besides there is some politics going on with one promoting Tez and 
>> another Spark as a backend. That is fine but obviously we prefer an 
>> independent assessment ourselves.
>> 
>> My gut feeling is that one needs to look at the use case. Recently we had to 
>> import a very large table from Oracle to Hive and decided to use Spark 1.6.1 
>> with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC 
>> connection with temp table and it was good. We could have used sqoop but 
>> decided to settle for Spark so it all depends on use case.
>> 
>> HTH
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 24 May 2016 at 03:11, ayan guha  wrote:
>>> Hi
>>> 
>>> Thanks for very useful stats. 
>>> 
>>> Did you have any benchmark for using Spark as backend engine for Hive vs 
>>> using Spark thrift server (and run spark code for hive queries)? We are 
>>> using later but it will be very useful to remove thriftserver, if we can. 
>>> 
 On Tue, May 24, 2016 at 9:51 AM, Jörn Franke  wrote:
 
 Hi Mich,
 
 I think these comparisons are useful. One interesting aspect could be 
 hardware scalability in this context. Additionally different type of 
 computations. Furthermore, one could compare Spark and Tez+llap as 
 execution engines. I have the gut feeling that  each one can be justified 
 by different use cases.
 Nevertheless, there should be always a disclaimer for such comparisons, 
 because Spark and Hive are not good for a lot of concurrent lookups of 
 single rows. They are not good for frequently write small amounts of data 
 (eg sensor data). Here hbase could be more interesting. Other use cases 
 can justify graph databases, such as Titan, or text analytics/ data 
 matching using Solr on Hadoop.
 Finally, even if you have a lot of data you need to think if you always 
 have to process everything. For instance, I have found valid use cases in 
 practice where we decided to evaluate 10 machine learning models in 
 parallel on only a sample of data and only evaluate the "winning" model of 
 the total of data.
 
 As always it depends :) 
 
 Best regards
 
 P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 
 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how 
 to manage bringing both together. You may check also Apache Bigtop (vendor 
 neutral distribution) on how they managed to bring both together.
 
> On 23 May 2016, at 01:42, Mich Talebzadeh  
> wrote:
> 
> Hi,
>  
> I have done a number of extensive tests using Spark-shell with Hive DB 
> and ORC tables.
>  

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-24 Thread Mich Talebzadeh
Hi,

We use Hive as the database and use Spark as an all purpose query tool.

Whether Hive is the write database for purpose or one is better off with
something like Phoenix on Hbase, well the answer is it depends and your
mileage varies.

So fit for purpose.

Ideally what wants is to use the fastest  method to get the results. How
fast we confine it to our SLA agreements in production and that helps us
from unnecessary further work as we technologists like to play around.

So in short, we use Spark most of the time and use Hive as the backend
engine for data storage, mainly ORC tables.

We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
at the moment it is one of my projects.

We do not use any vendor's products as it enables us to move away  from
being tied down after years of SAP, Oracle and MS dependency to yet another
vendor. Besides there is some politics going on with one promoting Tez and
another Spark as a backend. That is fine but obviously we prefer an
independent assessment ourselves.

My gut feeling is that one needs to look at the use case. Recently we had
to import a very large table from Oracle to Hive and decided to use Spark
1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC
connection with temp table and it was good. We could have used sqoop but
decided to settle for Spark so it all depends on use case.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 24 May 2016 at 03:11, ayan guha  wrote:

> Hi
>
> Thanks for very useful stats.
>
> Did you have any benchmark for using Spark as backend engine for Hive vs
> using Spark thrift server (and run spark code for hive queries)? We are
> using later but it will be very useful to remove thriftserver, if we can.
>
> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke  wrote:
>
>>
>> Hi Mich,
>>
>> I think these comparisons are useful. One interesting aspect could be
>> hardware scalability in this context. Additionally different type of
>> computations. Furthermore, one could compare Spark and Tez+llap as
>> execution engines. I have the gut feeling that  each one can be justified
>> by different use cases.
>> Nevertheless, there should be always a disclaimer for such comparisons,
>> because Spark and Hive are not good for a lot of concurrent lookups of
>> single rows. They are not good for frequently write small amounts of data
>> (eg sensor data). Here hbase could be more interesting. Other use cases can
>> justify graph databases, such as Titan, or text analytics/ data matching
>> using Solr on Hadoop.
>> Finally, even if you have a lot of data you need to think if you always
>> have to process everything. For instance, I have found valid use cases in
>> practice where we decided to evaluate 10 machine learning models in
>> parallel on only a sample of data and only evaluate the "winning" model of
>> the total of data.
>>
>> As always it depends :)
>>
>> Best regards
>>
>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive
>> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how to
>> manage bringing both together. You may check also Apache Bigtop (vendor
>> neutral distribution) on how they managed to bring both together.
>>
>> On 23 May 2016, at 01:42, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>>
>>
>> I have done a number of extensive tests using Spark-shell with Hive DB
>> and ORC tables.
>>
>>
>>
>> Now one issue that we typically face is and I quote:
>>
>>
>>
>> Spark is fast as it uses Memory and DAG. Great but when we save data it
>> is not fast enough
>>
>> OK but there is a solution now. If you use Spark with Hive and you are on
>> a descent version of Hive >= 0.14, then you can also deploy Spark as
>> execution engine for Hive. That will make your application run pretty fast
>> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
>> what you are gaining speed in both querying and storage.
>>
>>
>>
>> I have made some comparisons on this set-up and I am sure some of you
>> will find it useful.
>>
>>
>>
>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>>
>> The version of Hive I use in Hive 2
>>
>> The version of Spark I use as Hive execution engine is 1.3.1 It works and
>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out
>> the Hadoop libraries mismatch).
>>
>>
>>
>> An example I am using Hive on Spark engine to find the min and max of IDs
>> for a table with 1 billion rows:
>>
>>
>>
>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
>> stddev(id) from oraclehadoop.dummy;
>>
>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>
>>
>>
>>
>>
>> Starting Spark Job = 5e092ef

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar
Hi Dr Mich,
This is very good news. I will be interested to know how Hive engages with 
Spark as an engine. What Spark processes are used to make this work? 
Thanking you 

On Monday, 23 May 2016, 19:01, Mich Talebzadeh  
wrote:
 

 Have a look at this thread
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 23 May 2016 at 09:10, Mich Talebzadeh  wrote:

Hi Timur and everyone.
I will answer your first question as it is very relevant
1) How to make 2 versions of Spark live together on the same cluster (libraries 
clash, paths, etc.) ? 
Most of the Spark users perform ETL, ML operations on Spark as well. So, we may 
have 3 Spark installations simultaneously

There are two distinct points here.
Using Spark as a  query engine. That is BAU and most forum members use it 
everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster 
managers. You start master that does the management of resources and you start 
slaves to create workers. 
 You deploy Spark either by Spark-shell, Spark-sql or submit jobs through 
spark-submit etc. You may or may not use Hive as your database. You may use 
Hbase via Phoenix etcIf you choose to use Hive as your database, on every host 
of cluster including your master host, you ensure that Hive APIs are installed 
(meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd 
$SPARK_HOME/conf
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 hive-site.xml -> 
/usr/lib/hive/conf/hive-site.xml
Now in hive-site.xml you can define all the parameters needed for Spark 
connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE NOT 
RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers 
for Spark 1.3.1! It is just an execution engine like mr etc.
Let us look at how we do that in hive-site,xml. Noting the settings for 
hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 
below. That tells Hive to use spark 1.3.1 as the execution engine. You just 
install spark 1.3.1 on the host just the binary download it is 
/usr/lib/spark-1.3.1-bin-hadoop2.6
In hive-site.xml, you set the properties.
  
    hive.execution.engine
    spark
    
  Expects one of [mr, tez, spark].
  Chooses execution engine. Options are: mr (Map reduce, default), tez, 
spark. While MR
  remains the default engine for historical reasons, it is itself a 
historical engine
  and is deprecated in Hive 2 line. It may be removed without further 
warning.
    
    
    spark.home
    /usr/lib/spark-1.3.1-bin-hadoop2
    something
  

 
    hive.merge.sparkfiles
    false
    Merge small files at the end of a Spark DAG 
Transformation
    
    hive.spark.client.future.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for requests from Hive client to remote Spark driver.
    
  
    hive.spark.job.monitor.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for job monitor to get Spark job state.
    
 
  
    hive.spark.client.connect.timeout
    1000ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for remote Spark driver in connecting back to Hive client.
    
  
  
    hive.spark.client.server.connect.timeout
    9ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for handshake between Hive client and remote Spark driver.  
Checked by both processes.
    
  
  
    hive.spark.client.secret.bits
    256
    Number of bits of randomness in the generated secret for 
communication between Hive client and remote Spark driver. Rounded down to the 
nearest multiple of 8.
  
  
    hive.spark.client.rpc.threads
    8
    Maximum number of threads for remote Spark driver's RPC event 
loop.
  
And other settings as well
That was the Hive stuff for your Spark BAU. So there are two distinct things. 
Now going to Hive itself, you will need to add the correct assembly jar file 
for Hadoop. These are called 
spark-assembly-x.y.z-hadoop2.4.0.jar 
Where x.y.z in this case is 1.3.1 
The assembly file is
spark-assembly-1.3.1-hadoop2.4.0.jar
So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
/usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
And you need to compile spark from source excluding Hadoop dependencies 
./make-distribution.sh --name"hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

So Hive uses spark engine by default 
If you want to use mr 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Mich Talebzadeh
Have a look at this thread

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 23 May 2016 at 09:10, Mich Talebzadeh  wrote:

> Hi Timur and everyone.
>
> I will answer your first question as it is very relevant
>
> 1) How to make 2 versions of Spark live together on the same cluster
> (libraries clash, paths, etc.) ?
> Most of the Spark users perform ETL, ML operations on Spark as well. So,
> we may have 3 Spark installations simultaneously
>
> There are two distinct points here.
>
> Using Spark as a  query engine. That is BAU and most forum members use it
> everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster
> managers. You start master that does the management of resources and you
> start slaves to create workers.
>
>  You deploy Spark either by Spark-shell, Spark-sql or submit jobs through
> spark-submit etc. You may or may not use Hive as your database. You may use
> Hbase via Phoenix etc
> If you choose to use Hive as your database, on every host of cluster
> including your master host, you ensure that Hive APIs are installed
> (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to
> cd $SPARK_HOME/conf
> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
> lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 *hive-site.xml ->
> /usr/lib/hive/conf/hive-site.xml*
> Now in hive-site.xml you can define all the parameters needed for Spark
> connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE
> NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or
> workers for Spark 1.3.1! It is just an execution engine like mr etc.
>
> Let us look at how we do that in hive-site,xml. Noting the settings for
> hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2
> below. That tells Hive to use spark 1.3.1 as the execution engine. You just
> install spark 1.3.1 on the host just the binary download it is
> /usr/lib/spark-1.3.1-bin-hadoop2.6
>
> In hive-site.xml, you set the properties.
>
>   
> hive.execution.engine
> spark
> 
>   Expects one of [mr, tez, spark].
>   Chooses execution engine. Options are: mr (Map reduce, default),
> tez, spark. While MR
>   remains the default engine for historical reasons, it is itself a
> historical engine
>   and is deprecated in Hive 2 line. It may be removed without further
> warning.
> 
>   
>
>   
> spark.home
> /usr/lib/spark-1.3.1-bin-hadoop2
> something
>   
>
>  
> hive.merge.sparkfiles
> false
> Merge small files at the end of a Spark DAG
> Transformation
>   
>
>  
> hive.spark.client.future.timeout
> 60s
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>   Timeout for requests from Hive client to remote Spark driver.
> 
>  
>  
> hive.spark.job.monitor.timeout
> 60s
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>   Timeout for job monitor to get Spark job state.
> 
>  
>
>   
> hive.spark.client.connect.timeout
> 1000ms
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>   Timeout for remote Spark driver in connecting back to Hive client.
> 
>   
>
>   
> hive.spark.client.server.connect.timeout
> 9ms
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>   Timeout for handshake between Hive client and remote Spark driver.
> Checked by both processes.
> 
>   
>   
> hive.spark.client.secret.bits
> 256
> Number of bits of randomness in the generated secret for
> communication between Hive client and remote Spark driver. Rounded down to
> the nearest multiple of 8.
>   
>   
> hive.spark.client.rpc.threads
> 8
> Maximum number of threads for remote Spark driver's RPC
> event loop.
>   
>
> And other settings as well
>
> That was the Hive stuff for your Spark BAU. So there are two distinct
> things. Now going to Hive itself, you will need to add the correct assembly
> jar file for Hadoop. These are called
>
> spark-assembly-x.y.z-hadoop2.4.0.jar
>
> Where x.y.z in this case is 1.3.1
>
> The assembly file is
>
> spark-assembly-1.3.1-hadoop2.4.0.jar
>
> So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
>
> ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
> /usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
>
> And you need to compile spark from source excluding Hadoop dependencies
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,