Re: NPE when reading Parquet using Hive on Tez

2016-02-02 Thread Adam Hunt
HI Gopal,

With the release of 0.8.2, I thought I would give tez another shot.
Unfortunately, I got the same NPE. I dug a little deeper and it appears
that the configuration property "columns.types", which is used
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(),
is not being set. When I manually set that property in hive, your example
works fine.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> set columns.type=int;
hive> select count(*) from x where x.x > 1;
OK
1

I also saw that the configuration parameter parquet.columns.index.access is
also checked in that same function. Setting that property to "true" fixes
my issue.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> set parquet.column.index.access=true;
hive> select count(*) from x where x.x > 1;
OK
1

Thanks for your help.

Best,
Adam



On Tue, Jan 5, 2016 at 9:10 AM, Adam Hunt  wrote:

> Hi Gopal,
>
> Spark does offer dynamic allocation, but it doesn't always work as
> advertised. My experience with Tez has been more in line with my
> expectations. I'll bring up my issues with Spark on that list.
>
> I tried your example and got the same NPE. It might be a mapr-hive issue.
> Thanks for your help.
>
> Adam
>
> On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan 
> wrote:
>
>>
>> > select count(*) from alexa_parquet;
>>
>> > Caused by: java.lang.NullPointerException
>> >at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
>> >ze(TypeInfoUtils.java:274)
>> >at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.
>> >(TypeInfoUtils.java:293)
>> >at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
>> >tring(TypeInfoUtils.java:764)
>> >at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
>> >nTypes(DataWritableReadSupport.java:76)
>> >at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
>> >aWritableReadSupport.java:220)
>> >at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
>> >lit(ParquetRecordReaderWrapper.java:256)
>>
>> This might be an NPE triggered off by a specific case of the type parser.
>>
>> I tested it out on my current build with simple types and it looks like
>> the issue needs more detail on the column types for a repro.
>>
>> hive> create temporary table x (x int) stored as parquet;
>> hive> insert into x values(1),(2);
>> hive> select count(*) from x where x.x > 1;
>> Status: DAG finished successfully in 0.18 seconds
>> OK
>> 1
>> Time taken: 0.792 seconds, Fetched: 1 row(s)
>> hive>
>>
>> Do you have INT96 in the schema?
>>
>> > I'm currently evaluating Hive on Tez as an alternative to keeping the
>> >SparkSQL thrift sever running all the time locking up resources.
>>
>> Tez has a tunable value in tez.am.session.min.held-containers (i.e
>> something small like 10).
>>
>> And HiveServer2 can be made work similarly because spark
>> HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.
>>
>>
>>
>>
>>
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi,

 

My understanding is that with Hive on Spark engine, one gets the Hive
optimizer and Spark query engine

 

With spark using Hive metastore, Spark does both the optimization and query
engine. The only value add is that one can access the underlying Hive tables
from spark-sql etc

 

 

Is this assessment correct?

 

 

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 



Re: NPE when reading Parquet using Hive on Tez

2016-02-02 Thread Gopal Vijayaraghavan
> I dug a little deeper and it appears that the configuration property
>"columns.types", which is used
>org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(),
> is not being set. When I manually set that property in hive, your
>example works fine.

Good to know more about the NPE. ORC uses the exact same parameter.

ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:
columnTypeProperty = conf.get(serdeConstants.LIST_COLUMN_TYPES);

But I think this could have a very simple explanation.

Assuming you have a build of Tez, I would recommend adding a couple of
LOG.warn lines in TezGroupedSplitsInputFormat

public RecordReader getRecordReader(InputSplit split, JobConf job,
  Reporter reporter) throws IOException {


Particularly whether the "this.conf" or "job" conf object has the
column.types set?

My guess is that the set; command is setting that up in JobConf & the
default compiler places it in the this.conf object.

If that is the case, we can fix Parquet to pick it up off the right one.

Cheers,
Gopal













Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
When comparing the performance, you need to do it apple vs apple. In
another thread, you mentioned that Hive on Spark is much slower than Spark
SQL. However, you configured Hive such that only two tasks can run in
parallel. However, you didn't provide information on how much Spark SQL is
utilizing. Thus, it's hard to tell whether it's just a configuration
problem in your Hive or Spark SQL is indeed faster. You should be able to
see the resource usage in YARN resource manage URL.

--Xuefu

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh  wrote:

> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.563 seconds, Fetched 3 row(s)
>
>
>
> So three runs returning three rows just over 50 seconds
>
>
>
> *Hive 1.2.1 on spark 1.3.1 execution engine*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 10);
>
> INFO  :
>
> Query Hive on Spark job[4] stages:
>
> INFO  : 4
>
> INFO  :
>
> Status: Running (Hive on Spark job[4])
>
> INFO  : Status: Finished successfully in 82.49 seconds
>
>
> +---+--+--+---+-+-++--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> | dummy.random_string | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
> xx |
>
> | 10| 99   | 999  | 188   |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
> xx |
>
>
> +---+--+--+---+-+-++--+
>
> 3 rows selected (82.66 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 10);
>
> INFO  : Status: Finished successfully in 76.67 seconds
>
>
> +---+--+--+---+-+-++--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> | dummy.random_string | dummy.small_vc  |
> dummy.padding  |
>
>
> +---+--+--+---+-+-++--+
>
> | 1 | 0| 0| 63|
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
> xx |
>
> | 5 | 0| 4| 31|
> 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
I think the diff is not only about which does optimization but more on
feature parity. Hive on Spark offers all functional features that Hive
offers and these features play out faster. However, Spark SQL is far from
offering this parity as far as I know.

On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> My understanding is that with Hive on Spark engine, one gets the Hive
> optimizer and Spark query engine
>
>
>
> With spark using Hive metastore, Spark does both the optimization and
> query engine. The only value add is that one can access the underlying Hive
> tables from spark-sql etc
>
>
>
>
>
> Is this assessment correct?
>
>
>
>
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>


RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

 

This may be:

 

1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
xx |

| 10| 99   | 999  | 188   | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  | 
xx |

+---+--+--+---+-+-++--+

3 rows selected (82.66 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  : Status: Finished successfully in 76.67 seconds

+---+--+--+---+-+-++--+

| dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  | 
dummy.random_string | dummy.small_vc  | dummy.padding  |

+---+--+--+---+-+-++--+

| 1 | 0| 0| 63| 
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  | 
xx |

| 5 | 0| 4| 31| 
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  | 
xx |

| 10| 99   | 999  | 188   | 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  | 
xx |

+---+--+--+---+-+-++--+

3 rows selected (76.835 seconds)

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  : Status: Finished successfully in 80.54 seconds


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Philip Lee
>From my experience, spark sql has its own optimizer to support Hive query
and metastore. After 1.5.2 spark, its optimizer is named catalyst.
2016. 2. 3. 오전 12:12에 "Xuefu Zhang" 님이 작성:

> I think the diff is not only about which does optimization but more on
> feature parity. Hive on Spark offers all functional features that Hive
> offers and these features play out faster. However, Spark SQL is far from
> offering this parity as far as I know.
>
> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>>
>>
>> My understanding is that with Hive on Spark engine, one gets the Hive
>> optimizer and Spark query engine
>>
>>
>>
>> With spark using Hive metastore, Spark does both the optimization and
>> query engine. The only value add is that one can access the underlying Hive
>> tables from spark-sql etc
>>
>>
>>
>>
>>
>> Is this assessment correct?
>>
>>
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>
>


RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi,

 

Are you referring to spark-shell with Scala, Python and others? 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com] 
Sent: 03 February 2016 00:09
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh  > wrote:

Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

 

This may be:

 

1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds

+---+--+--+---+-+-++--+

| dummy.id    | dummy.clustered  | dummy.scattered  | 

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Mich Talebzadeh
Hi Jeff,

 

In below

 

…. You should be able to see the resource usage in YARN resource manage URL.

 

Just to be clear we are talking about Port 8088/cluster?

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Koert Kuipers [mailto:ko...@tresata.com] 
Sent: 03 February 2016 00:09
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

 

uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

 

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  > wrote:

When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.

--Xuefu

 

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh  > wrote:

Thanks Jeff.

 

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

 

This may be:

 

1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

 

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

 

Spark 1.5.2 on Hive 1.2.1 Metastore

 

 

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.805 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.358 seconds, Fetched 3 row(s)

spark-sql> select * from dummy where id in (1, 5, 10);

1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx

5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx

10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx

Time taken: 50.563 seconds, Fetched 3 row(s)

 

So three runs returning three rows just over 50 seconds

 

Hive 1.2.1 on spark 1.3.1 execution engine

 

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);

INFO  :

Query Hive on Spark job[4] stages:

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[4])

INFO  : Status: Finished successfully in 82.49 seconds


Re: GenericUDF

2016-02-02 Thread Jason Dere
- Created once when registering the function to the FunctionRegistry.

- The UDF is copied from the version in the registry during query compilation

- The query plan is serialized, then deserialized by the tasks during query 
execution, which constructs another instance of the UDF.




From: Anirudh Paramshetti 
Sent: Tuesday, February 02, 2016 6:29 AM
To: user@hive.apache.org
Subject: GenericUDF

Hi,

I have written a custom UDF in Java extending the GenericUDF class. I have some 
print statements in the constructor and initialize method, as to understand the 
number of calls made to them. From what I have read about GenericUDF, I was 
expecting the constructor and initialize method to be called once per UDF 
instance. But what I found out was, the constructor was called thrice(once 
while creating the temporary function and twice while using it in the hive 
query) and the initialize method was called twice(while using it in the hive 
query).

UDF output:

hive> create temporary function replace as 
'package.name.GenericNullReplacement';
Inside constructor of GenericNullReplacement

hive> select replace(column_name, 0.01) from dummy_table;
Inside constructor of GenericNullReplacement
Inside constructor of GenericNullReplacement
Inside initialize() method of GenericNullReplacement
Inside initialize() method of GenericNullReplacement
1.23
4.56
4.56
0.01
4.56
9.56

It would be great if someone could explain me what is happening here?


Thanks and Regards,
Anirudh Paramshetti


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
uuuhm with spark using Hive metastore you actually have a real programming
environment and you can write real functions, versus just being boxed into
some version of sql and limited udfs?

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:

> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
>> Thanks Jeff.
>>
>>
>>
>> Obviously Hive is much more feature rich compared to Spark. Having said
>> that in certain areas for example where the SQL feature is available in
>> Spark, Spark seems to deliver faster.
>>
>>
>>
>> This may be:
>>
>>
>>
>> 1.Spark does both the optimisation and execution seamlessly
>>
>> 2.Hive on Spark has to invoke YARN that adds another layer to the
>> process
>>
>>
>>
>> Now I did some simple tests on a 100Million rows ORC table available
>> through Hive to both.
>>
>>
>>
>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>
>>
>>
>>
>>
>> spark-sql> select * from dummy where id in (1, 5, 10);
>>
>> 1   0   0   63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>> xx
>>
>> 5   0   4   31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>> xx
>>
>> 10  99  999 188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>> xx
>>
>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>
>> spark-sql> select * from dummy where id in (1, 5, 10);
>>
>> 1   0   0   63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>> xx
>>
>> 5   0   4   31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>> xx
>>
>> 10  99  999 188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>> xx
>>
>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>
>> spark-sql> select * from dummy where id in (1, 5, 10);
>>
>> 1   0   0   63
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>> xx
>>
>> 5   0   4   31
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>> xx
>>
>> 10  99  999 188
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>> xx
>>
>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>
>>
>>
>> So three runs returning three rows just over 50 seconds
>>
>>
>>
>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>
>>
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  :
>>
>> Query Hive on Spark job[4] stages:
>>
>> INFO  : 4
>>
>> INFO  :
>>
>> Status: Running (Hive on Spark job[4])
>>
>> INFO  : Status: Finished successfully in 82.49 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding  |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | 1 | 0| 0| 63|
>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>> xx |
>>
>> | 5 | 0| 4| 31|
>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>> xx |
>>
>> | 10| 99   | 999  | 188   |
>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>> xx |
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> 3 rows selected (82.66 seconds)
>>
>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>> (1, 5, 10);
>>
>> INFO  : Status: Finished successfully in 76.67 seconds
>>
>>
>> +---+--+--+---+-+-++--+
>>
>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>> | dummy.random_string | dummy.small_vc  |
>> dummy.padding 

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Ryan Harris
https://github.com/myui/hivemall

as long as you are comfortable with java UDFs, the sky is really the 
limit...it's not for everyone and spark does have many advantages, but they are 
two tools that can complement each other in numerous ways.

I don't know that there is necessarily a universal "better" for how to use 
spark as an execution engine (or if spark is necessarily the *best* execution 
engine for any given hive job).

The reality is that once you start factoring in the numerous tuning parameters 
of the systems and jobs there probably isn't a clear answer.  For some queries, 
the Catalyst optimizer may do a better job...is it going to do a better job 
with ORC based data? less likely IMO.

From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Tuesday, February 02, 2016 9:50 PM
To: user@hive.apache.org
Subject: Re: Hive on Spark Engine versus Spark using Hive metastore

yeah but have you ever seen somewhat write a real analytical program in hive? 
how? where are the basic abstractions to wrap up a large amount of operations 
(joins, groupby's) into a single function call? where are the tools to write 
nice unit test for that?
for example in spark i can write a DataFrame => DataFrame that internally does 
many joins, groupBys and complex operations. all unit tested and perfectly 
re-usable. and in hive? copy paste round sql queries? thats just dangerous.

On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
> wrote:
Hive has numerous extension points, you are not boxed in by a long shot.


On Tuesday, February 2, 2016, Koert Kuipers 
> wrote:
uuuhm with spark using Hive metastore you actually have a real programming 
environment and you can write real functions, versus just being boxed into some 
version of sql and limited udfs?

On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
When comparing the performance, you need to do it apple vs apple. In another 
thread, you mentioned that Hive on Spark is much slower than Spark SQL. 
However, you configured Hive such that only two tasks can run in parallel. 
However, you didn't provide information on how much Spark SQL is utilizing. 
Thus, it's hard to tell whether it's just a configuration problem in your Hive 
or Spark SQL is indeed faster. You should be able to see the resource usage in 
YARN resource manage URL.
--Xuefu

On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh  wrote:
Thanks Jeff.

Obviously Hive is much more feature rich compared to Spark. Having said that in 
certain areas for example where the SQL feature is available in Spark, Spark 
seems to deliver faster.

This may be:


1.Spark does both the optimisation and execution seamlessly

2.Hive on Spark has to invoke YARN that adds another layer to the process

Now I did some simple tests on a 100Million rows ORC table available through 
Hive to both.

Spark 1.5.2 on Hive 1.2.1 Metastore


spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.805 seconds, Fetched 3 row(s)
spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.358 seconds, Fetched 3 row(s)
spark-sql> select * from dummy where id in (1, 5, 10);
1   0   0   63  
rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
xx
5   0   4   31  
vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
xx
10  99  999 188 
abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
xx
Time taken: 50.563 seconds, Fetched 3 row(s)

So three runs returning three rows just over 50 seconds

Hive 1.2.1 on spark 1.3.1 execution engine

0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, 5, 
10);
INFO  :
Query Hive on Spark job[4] stages:
INFO  : 4
INFO  :
Status: Running (Hive on Spark job[4])
INFO  : Status: Finished successfully in 82.49 seconds
+---+--+--+---+-+-++--+
| dummy.id  | dummy.clustered  | dummy.scattered  | 
dummy.randomised  | dummy.random_string  

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Jörn Franke
Check HiveMall

> On 03 Feb 2016, at 05:49, Koert Kuipers  wrote:
> 
> yeah but have you ever seen somewhat write a real analytical program in hive? 
> how? where are the basic abstractions to wrap up a large amount of operations 
> (joins, groupby's) into a single function call? where are the tools to write 
> nice unit test for that? 
> 
> for example in spark i can write a DataFrame => DataFrame that internally 
> does many joins, groupBys and complex operations. all unit tested and 
> perfectly re-usable. and in hive? copy paste round sql queries? thats just 
> dangerous.
> 
>> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo  
>> wrote:
>> Hive has numerous extension points, you are not boxed in by a long shot.
>> 
>> 
>>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>> uuuhm with spark using Hive metastore you actually have a real programming 
>>> environment and you can write real functions, versus just being boxed into 
>>> some version of sql and limited udfs?
>>> 
 On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
 When comparing the performance, you need to do it apple vs apple. In 
 another thread, you mentioned that Hive on Spark is much slower than Spark 
 SQL. However, you configured Hive such that only two tasks can run in 
 parallel. However, you didn't provide information on how much Spark SQL is 
 utilizing. Thus, it's hard to tell whether it's just a configuration 
 problem in your Hive or Spark SQL is indeed faster. You should be able to 
 see the resource usage in YARN resource manage URL.
 
 --Xuefu
 
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh  
> wrote:
> Thanks Jeff.
> 
>  
> 
> Obviously Hive is much more feature rich compared to Spark. Having said 
> that in certain areas for example where the SQL feature is available in 
> Spark, Spark seems to deliver faster.
> 
>  
> 
> This may be:
> 
>  
> 
> 1.Spark does both the optimisation and execution seamlessly
> 
> 2.Hive on Spark has to invoke YARN that adds another layer to the 
> process
> 
>  
> 
> Now I did some simple tests on a 100Million rows ORC table available 
> through Hive to both.
> 
>  
> 
> Spark 1.5.2 on Hive 1.2.1 Metastore
> 
>  
> 
>  
> 
> spark-sql> select * from dummy where id in (1, 5, 10);
> 
> 1   0   0   63  
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
> xx
> 
> 5   0   4   31  
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
> xx
> 
> 10  99  999 188 
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
> xx
> 
> Time taken: 50.805 seconds, Fetched 3 row(s)
> 
> spark-sql> select * from dummy where id in (1, 5, 10);
> 
> 1   0   0   63  
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
> xx
> 
> 5   0   4   31  
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
> xx
> 
> 10  99  999 188 
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
> xx
> 
> Time taken: 50.358 seconds, Fetched 3 row(s)
> 
> spark-sql> select * from dummy where id in (1, 5, 10);
> 
> 1   0   0   63  
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1  
> xx
> 
> 5   0   4   31  
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5  
> xx
> 
> 10  99  999 188 
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10  
> xx
> 
> Time taken: 50.563 seconds, Fetched 3 row(s)
> 
>  
> 
> So three runs returning three rows just over 50 seconds
> 
>  
> 
> Hive 1.2.1 on spark 1.3.1 execution engine
> 
>  
> 
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in 
> (1, 5, 10);
> 
> INFO  :
> 
> Query Hive on Spark job[4] stages:
> 
> INFO  : 4
> 
> INFO  :
> 
> Status: Running (Hive on Spark job[4])
> 
> INFO  : Status: Finished successfully in 82.49 seconds
> 
> +---+--+--+---+-+-++--+
> 
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised  |   
>   dummy.random_string | dummy.small_vc  | 
> 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
yes. the ability to start with sql but when needed expand into more full
blown programming languages, machine learning etc. is a huge plus. after
all this is a cluster, and just querying or extracting data to move it off
the cluster into some other analytics tool is going to be very inefficient
and defeats the purpose to some extend of having a cluster. so you want to
have a capability to do more than queries and etl. and spark is that
ticket. hive is simply not. well not for anything somewhat complex anyhow.


On Tue, Feb 2, 2016 at 8:06 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> Are you referring to spark-shell with Scala, Python and others?
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
yeah but have you ever seen somewhat write a real analytical program in
hive? how? where are the basic abstractions to wrap up a large amount of
operations (joins, groupby's) into a single function call? where are the
tools to write nice unit test for that?

for example in spark i can write a DataFrame => DataFrame that internally
does many joins, groupBys and complex operations. all unit tested and
perfectly re-usable. and in hive? copy paste round sql queries? thats just
dangerous.

On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
wrote:

> Hive has numerous extension points, you are not boxed in by a long shot.
>
>
> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>
>> uuuhm with spark using Hive metastore you actually have a real
>> programming environment and you can write real functions, versus just being
>> boxed into some version of sql and limited udfs?
>>
>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>
>>> When comparing the performance, you need to do it apple vs apple. In
>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>> SQL. However, you configured Hive such that only two tasks can run in
>>> parallel. However, you didn't provide information on how much Spark SQL is
>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>> see the resource usage in YARN resource manage URL.
>>>
>>> --Xuefu
>>>
>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
>>> wrote:
>>>
 Thanks Jeff.



 Obviously Hive is much more feature rich compared to Spark. Having said
 that in certain areas for example where the SQL feature is available in
 Spark, Spark seems to deliver faster.



 This may be:



 1.Spark does both the optimisation and execution seamlessly

 2.Hive on Spark has to invoke YARN that adds another layer to the
 process



 Now I did some simple tests on a 100Million rows ORC table available
 through Hive to both.



 *Spark 1.5.2 on Hive 1.2.1 Metastore*





 spark-sql> select * from dummy where id in (1, 5, 10);

 1   0   0   63
 rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
 xx

 5   0   4   31
 vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
 xx

 10  99  999 188
 abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
 xx

 Time taken: 50.805 seconds, Fetched 3 row(s)

 spark-sql> select * from dummy where id in (1, 5, 10);

 1   0   0   63
 rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
 xx

 5   0   4   31
 vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
 xx

 10  99  999 188
 abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
 xx

 Time taken: 50.358 seconds, Fetched 3 row(s)

 spark-sql> select * from dummy where id in (1, 5, 10);

 1   0   0   63
 rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
 xx

 5   0   4   31
 vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
 xx

 10  99  999 188
 abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
 xx

 Time taken: 50.563 seconds, Fetched 3 row(s)



 So three runs returning three rows just over 50 seconds



 *Hive 1.2.1 on spark 1.3.1 execution engine*



 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
 (1, 5, 10);

 INFO  :

 Query Hive on Spark job[4] stages:

 INFO  : 4

 INFO  :

 Status: Running (Hive on Spark job[4])

 INFO  : Status: Finished successfully in 82.49 seconds


 +---+--+--+---+-+-++--+

 | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
 | dummy.random_string | dummy.small_vc  |
 dummy.padding  |


 +---+--+--+---+-+-++--+

 | 1 | 0| 0| 63|
 rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
 xx |

 | 5 | 0| 4  

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Koert Kuipers
ok i am sure there is some way to do it. i am going to guess snippets of
hive code stuck together with oozie jobs or whatever. the oozie jobs become
the re-usable pieces perhaps? now you got sql and xml, completely lacking
any benefits of a compiler to catch errors. unit tests will be slow if even
available at all. so yeah
yeah i am sure it can be made to *work*. just like you can get a nail into
a wall with a screwdriver if you really want.

On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers  wrote:

> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
> wrote:
>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>>
 When comparing the performance, you need to do it apple vs apple. In
 another thread, you mentioned that Hive on Spark is much slower than Spark
 SQL. However, you configured Hive such that only two tasks can run in
 parallel. However, you didn't provide information on how much Spark SQL is
 utilizing. Thus, it's hard to tell whether it's just a configuration
 problem in your Hive or Spark SQL is indeed faster. You should be able to
 see the resource usage in YARN resource manage URL.

 --Xuefu

 On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
 wrote:

> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having
> said that in certain areas for example where the SQL feature is available
> in Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.563 seconds, Fetched 3 row(s)
>
>
>
> So three runs returning three rows just over 50 seconds
>
>
>
> *Hive 1.2.1 on spark 1.3.1 execution engine*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
> (1, 5, 10);
>
> INFO  :
>
> Query Hive on Spark job[4] stages:
>
> INFO  : 4
>
> INFO  :
>
> Status: Running (Hive on Spark job[4])
>
> INFO  : Status: Finished successfully in 82.49 seconds
>
>
> 

Re: GenericUDF

2016-02-02 Thread Anirudh Paramshetti
Thanks Jason for your inputs.

I believe you are talking about the number of instances created, which
explains why the constructor was called thrice. But I'm still unclear about
the two calls made to the initialize method, when I use the temporary
function in the query. Can you put some more light on the call flow to the
initialize method.?

Regards,
Anirudh Paramshetti



On Wed, Feb 3, 2016 at 6:08 AM, Jason Dere  wrote:

> - Created once when registering the function to the FunctionRegistry.
>
> - The UDF is copied from the version in the registry during query
> compilation
>
> - The query plan is serialized, then deserialized by the tasks during
> query execution, which constructs another instance of the UDF.
>
>
>
> --
> *From:* Anirudh Paramshetti 
> *Sent:* Tuesday, February 02, 2016 6:29 AM
> *To:* user@hive.apache.org
> *Subject:* GenericUDF
>
> Hi,
>
> I have written a custom UDF in Java extending the GenericUDF class. I have
> some print statements in the constructor and initialize method, as to
> understand the number of calls made to them. From what I have read about
> GenericUDF, I was expecting the constructor and initialize method to be
> called once per UDF instance. But what I found out was, the constructor was
> called thrice(once while creating the temporary function and twice while
> using it in the hive query) and the initialize method was called
> twice(while using it in the hive query).
>
> UDF output:
>
> hive> create temporary function replace as
> 'package.name.GenericNullReplacement';
> Inside constructor of GenericNullReplacement
>
> hive> select replace(column_name, 0.01) from dummy_table;
> Inside constructor of GenericNullReplacement
> Inside constructor of GenericNullReplacement
> Inside initialize() method of GenericNullReplacement
> Inside initialize() method of GenericNullReplacement
> 1.23
> 4.56
> 4.56
> 0.01
> 4.56
> 9.56
>
> It would be great if someone could explain me what is happening here?
>
>
> Thanks and Regards,
> Anirudh Paramshetti
>


Hive Query Timeout in hive-jdbc

2016-02-02 Thread Satya Harish Appana
Hi Team,

  I am trying to connect to hiveServer via hive-jdbc.
Can we configure client side timeout at each query executed inside each
jdbc connection. (When I looked at HiveStatement.setQueryTimeout method it
says operation unsupported).
Is there any other way of timing out and cancelling the connection and
throwing Exception, if it alive for over a period of 4 mins or so
(configurable at client side).

PS : Queries that I am executing over jdbc are simple ddl statements. (hive
external table create statements and drop table statements).


Regards,
Satya Harish.


GenericUDF

2016-02-02 Thread Anirudh Paramshetti
Hi,

I have written a custom UDF in Java extending the GenericUDF class. I have
some print statements in the constructor and initialize method, as to
understand the number of calls made to them. From what I have read about
GenericUDF, I was expecting the constructor and initialize method to be
called once per UDF instance. But what I found out was, the constructor was
called thrice(once while creating the temporary function and twice while
using it in the hive query) and the initialize method was called
twice(while using it in the hive query).

UDF output:

hive> create temporary function replace as
'package.name.GenericNullReplacement';
Inside constructor of GenericNullReplacement

hive> select replace(column_name, 0.01) from dummy_table;
Inside constructor of GenericNullReplacement
Inside constructor of GenericNullReplacement
Inside initialize() method of GenericNullReplacement
Inside initialize() method of GenericNullReplacement
1.23
4.56
4.56
0.01
4.56
9.56

It would be great if someone could explain me what is happening here?


Thanks and Regards,
Anirudh Paramshetti


Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread 董亚军
hive does not support timeout on the client side.

and I think it is not recommended that if the client exit with timeout
exception, the hiveserver side may also running the job. this will result
in inconsistent state.

On Tue, Feb 2, 2016 at 4:49 PM, Satya Harish Appana <
satyaharish.app...@gmail.com> wrote:

> Hi Team,
>
>   I am trying to connect to hiveServer via hive-jdbc.
> Can we configure client side timeout at each query executed inside each
> jdbc connection. (When I looked at HiveStatement.setQueryTimeout method it
> says operation unsupported).
> Is there any other way of timing out and cancelling the connection and
> throwing Exception, if it alive for over a period of 4 mins or so
> (configurable at client side).
>
> PS : Queries that I am executing over jdbc are simple ddl statements.
> (hive external table create statements and drop table statements).
>
>
> Regards,
> Satya Harish.
>


Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread Loïc Chanel
Actually, Hive doesn't support timeout, but Tez and MapReduce does.
Therefore, you can set a timeout on these tools to kill failed queries.
Hope this helps,

Loïc

Loïc CHANEL
System & virtualization engineer
TO - XaaS Ind - Worldline (Villeurbanne, France)

2016-02-02 11:10 GMT+01:00 董亚军 :

> hive does not support timeout on the client side.
>
> and I think it is not recommended that if the client exit with timeout
> exception, the hiveserver side may also running the job. this will result
> in inconsistent state.
>
> On Tue, Feb 2, 2016 at 4:49 PM, Satya Harish Appana <
> satyaharish.app...@gmail.com> wrote:
>
>> Hi Team,
>>
>>   I am trying to connect to hiveServer via hive-jdbc.
>> Can we configure client side timeout at each query executed inside each
>> jdbc connection. (When I looked at HiveStatement.setQueryTimeout method it
>> says operation unsupported).
>> Is there any other way of timing out and cancelling the connection and
>> throwing Exception, if it alive for over a period of 4 mins or so
>> (configurable at client side).
>>
>> PS : Queries that I am executing over jdbc are simple ddl statements.
>> (hive external table create statements and drop table statements).
>>
>>
>> Regards,
>> Satya Harish.
>>
>
>


Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread Satya Harish Appana
Queries I am running over Hive JDBC are ddl statements(none of the queries
are select or insert. which will result in an execution engine(tez/mr) job
to be launched.. all the queries are create external table .. and drop
table .. and alter table add partitions).


On Tue, Feb 2, 2016 at 3:54 PM, Loïc Chanel 
wrote:

> Actually, Hive doesn't support timeout, but Tez and MapReduce does.
> Therefore, you can set a timeout on these tools to kill failed queries.
> Hope this helps,
>
> Loïc
>
> Loïc CHANEL
> System & virtualization engineer
> TO - XaaS Ind - Worldline (Villeurbanne, France)
>
> 2016-02-02 11:10 GMT+01:00 董亚军 :
>
>> hive does not support timeout on the client side.
>>
>> and I think it is not recommended that if the client exit with timeout
>> exception, the hiveserver side may also running the job. this will result
>> in inconsistent state.
>>
>> On Tue, Feb 2, 2016 at 4:49 PM, Satya Harish Appana <
>> satyaharish.app...@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>>   I am trying to connect to hiveServer via hive-jdbc.
>>> Can we configure client side timeout at each query executed inside each
>>> jdbc connection. (When I looked at HiveStatement.setQueryTimeout method it
>>> says operation unsupported).
>>> Is there any other way of timing out and cancelling the connection and
>>> throwing Exception, if it alive for over a period of 4 mins or so
>>> (configurable at client side).
>>>
>>> PS : Queries that I am executing over jdbc are simple ddl statements.
>>> (hive external table create statements and drop table statements).
>>>
>>>
>>> Regards,
>>> Satya Harish.
>>>
>>
>>
>


-- 


Regards,
Satya Harish Appana,
Software Development Engineer II,
Flipkart,Bangalore,
Ph:+91-9538797174.


Re: ORC format

2016-02-02 Thread Lefty Leverenz
Can't resist teasing Mich about this:  "Indeed one often demoralises data
taking advantages of massive parallel processing in Hive."

Surely he meant denormalizes .
Nobody would want to demoralise their data -- performance would suffer.  ;)

-- Lefty


On Mon, Feb 1, 2016 at 10:00 AM, Mich Talebzadeh 
wrote:

> Thanks Alan for this explanation. Interesting to see Primary Key in Hive.
>
>
>
>
>
> Sometimes comparison is made between Hive Storage Index concept in Orc and
> Oracle Exadata  storage index that also uses the same terminology!
>
>
>
> It is a bit of a misnomer to call Oracle Exadata indexes a “storage
> index”, since it appears that Exadata stores data block from tables in the
> storage index, usually when they are accessed via a full-table scan.  In
> this context Exadata storage index is not a “real” index in the sense that
> the storage index exists only in RAM, and it must be re-created from
> scratch when the Exadata server is bounced.
>
>
>
> Oracle Exadata  and SAP HANA as far as I know force serial scans into
> Hardware - with HANA, it is by pushing the bitmaps into the L2 cache on the
> chip - Oracle has special processors on SPARC T5 called D 
> that offloads the column bit scan off the CPU and onto separate specialized
> HW.  As a result, both rely on massive parallelization..
>
>
>
>
>
> Orc storage index is neat and different from both Exadata and SAP HANA,
> The way I see ORC storage indexes
>
>
>
> · They are combined Index and statistics.
>
> · Each index has statistics of min, max, count, and sum for each
> column in the row group of 10,000 rows.
>
> · Crucially, it has the location of the start of each row group,
> so that the query can jump straight to the beginning of the row group.
>
> · The query can do  a SARG pushdown that limits which rows are
> required for the query and can avoid reading an entire file, or at least
> sections of the file which is by and large what a conventional RDBMS B-tree
> index does.
>
>
>
>
>
> Cheers,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Alan Gates [mailto:alanfga...@gmail.com]
> *Sent:* 01 February 2016 17:07
> *To:* user@hive.apache.org
> *Subject:* Re: ORC format
>
>
>
> ORC does not currently expose a primary key to the user, though we have
> talked of having it do that.  As Mich says the indexing on ORC is oriented
> towards statistics that help the optimizer plan the query.  This can be
> very important in split generation (determining which parts of the input
> will be read by which tasks) as well as on the fly input pruning (deciding
> not to read a section of the file because the stats show that no rows in
> that section will match a predicate).  Either of these can help joins.  But
> as there is not a user visible primary key there's no ability to rewrite
> the join as an index based join, which I think is what you were asking
> about in your original email.
>
> Alan.
>
>
> *Philip Lee* 
>
> February 1, 2016 at 7:27
>
> Also,
>
> when making ORC from CSV,
>
> for indexing every key on each coulmn is made, or a primary on a table is
> made ?
>
>
>
> If keys are made on each column in a table, accessing any column in some
> functions like filtering should be faster.
>
>
>
>
>
>
> --
>
> ==
>
> *Hae Joon Lee*
>
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik 

Re: Hive Query Timeout in hive-jdbc

2016-02-02 Thread Loïc Chanel
Then indeed Tez and MR timeout won't be any help, sorry.
I would be very interested in your solution though.
Regards,

Loïc

Loïc CHANEL
System & virtualization engineer
TO - XaaS Ind - Worldline (Villeurbanne, France)

2016-02-02 11:27 GMT+01:00 Satya Harish Appana :

> Queries I am running over Hive JDBC are ddl statements(none of the queries
> are select or insert. which will result in an execution engine(tez/mr) job
> to be launched.. all the queries are create external table .. and drop
> table .. and alter table add partitions).
>
>
> On Tue, Feb 2, 2016 at 3:54 PM, Loïc Chanel 
> wrote:
>
>> Actually, Hive doesn't support timeout, but Tez and MapReduce does.
>> Therefore, you can set a timeout on these tools to kill failed queries.
>> Hope this helps,
>>
>> Loïc
>>
>> Loïc CHANEL
>> System & virtualization engineer
>> TO - XaaS Ind - Worldline (Villeurbanne, France)
>>
>> 2016-02-02 11:10 GMT+01:00 董亚军 :
>>
>>> hive does not support timeout on the client side.
>>>
>>> and I think it is not recommended that if the client exit with timeout
>>> exception, the hiveserver side may also running the job. this will result
>>> in inconsistent state.
>>>
>>> On Tue, Feb 2, 2016 at 4:49 PM, Satya Harish Appana <
>>> satyaharish.app...@gmail.com> wrote:
>>>
 Hi Team,

   I am trying to connect to hiveServer via hive-jdbc.
 Can we configure client side timeout at each query executed inside each
 jdbc connection. (When I looked at HiveStatement.setQueryTimeout method it
 says operation unsupported).
 Is there any other way of timing out and cancelling the connection and
 throwing Exception, if it alive for over a period of 4 mins or so
 (configurable at client side).

 PS : Queries that I am executing over jdbc are simple ddl statements.
 (hive external table create statements and drop table statements).


 Regards,
 Satya Harish.

>>>
>>>
>>
>
>
> --
>
>
> Regards,
> Satya Harish Appana,
> Software Development Engineer II,
> Flipkart,Bangalore,
> Ph:+91-9538797174.
>


How to run multiple queries from one tool

2016-02-02 Thread Riesland, Zack
I'm sure this is a total rookie question, but I'm months into using Hive and it 
hasn't become obvious to me yet:

When I use a tool like Aqua Data Studio and point it at a MSSQL Server 
database, I can run multiple queries, separated by a semicolon character ';'

So:

select blah from blah where criteria = 1;

select blah from blah where criteria = 2;

I can click 'execute' or 'ctrl + e' and both queries fire.

But in Hive, I can't do this. Even one query will fail if it has a semicolon 
character.

Is there a different delineator, or is this just a hard limitation of the 
drivers?

Thanks!




Re: Hive table over S3 bucket with s3a

2016-02-02 Thread Terry Siu
Yeah, that’s what I thought. I found this: 
https://issues.apache.org/jira/browse/HADOOP-3733. Posted a couple of questions 
there, but prior to that, the last comment was over a year ago. Thanks for the 
response!

-Terry

From: Elliot West >
Reply-To: "user@hive.apache.org" 
>
Date: Tuesday, February 2, 2016 at 7:57 AM
To: "user@hive.apache.org" 
>
Subject: Re: Hive table over S3 bucket with s3a

When I last looked at this it was recommended to simply regenerate the key as 
you suggest.

On 2 February 2016 at 15:52, Terry Siu 
> wrote:
Hi,

I’m wondering if anyone has found a workaround for defining a Hive table over a 
S3 bucket when the secret access key has ‘/‘ characters in it. I’m using Hive 
0.14 in HDP 2.2.4 and the statement that I used is:


CREATE EXTERNAL TABLE IF NOT EXISTS s3_foo (

  key INT, value STRING

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t’

LOCATION 's3a://:@/’;


The following error is returned:


FAILED: IllegalArgumentException The bucketName parameter must be specified.


A workaround was to set the fs.s3a.access.key and fs.s3a.secret.key 
configuration and then change the location URL to be s3a:///. 
However, this produces the following error:


FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:com.amazonaws.AmazonClientException: Unable to load AWS 
credentials from any provider in the chain)


Has anyone found a way to create a Hive over S3 table when the key contains ‘/‘ 
characters or it just standard practice to simply regenerate the keys until IAM 
returns one that doesn’t have the offending characters?


Thanks,

-Terry



RE: ORC format

2016-02-02 Thread Mich Talebzadeh
You are welcome Phil

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Philip Lee [mailto:philjj...@gmail.com] 
Sent: 02 February 2016 16:10
To: user@hive.apache.org
Subject: Re: ORC format

 

I really appreicate what you told me through this emailing-list.

 

Best,

Phil

 

On Tue, Feb 2, 2016 at 12:16 PM, Mich Talebzadeh  > wrote:

Correct :). 

 

Lord knows how these spell checkers work sometime! Perish the thought of 
demoralising the data.

 

 

Regards,

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Lefty Leverenz [mailto:leftylever...@gmail.com 
 ] 
Sent: 02 February 2016 10:26


To: user@hive.apache.org  
Subject: Re: ORC format

 

Can't resist teasing Mich about this:  "Indeed one often demoralises data 
taking advantages of massive parallel processing in Hive."

 

Surely he meant denormalizes  .  
Nobody would want to demoralise their data -- performance would suffer.  ;)




-- Lefty

 

 

On Mon, Feb 1, 2016 at 10:00 AM, Mich Talebzadeh  > wrote:

Thanks Alan for this explanation. Interesting to see Primary Key in Hive.

 

 

Sometimes comparison is made between Hive Storage Index concept in Orc and 
Oracle Exadata  storage index that also uses the same terminology!

 

It is a bit of a misnomer to call Oracle Exadata indexes a “storage index”, 
since it appears that Exadata stores data block from tables in the storage 
index, usually when they are accessed via a full-table scan.  In this context 
Exadata storage index is not a “real” index in the sense that the storage index 
exists only in RAM, and it must be re-created from scratch when the Exadata 
server is bounced.

 

Oracle Exadata  and SAP HANA as far as I know force serial scans into Hardware 
- with HANA, it is by pushing the bitmaps into the L2 cache on the chip - 
Oracle has special processors on SPARC T5 called D  that 
offloads the column bit scan off the CPU and onto separate specialized HW.  As 
a result, both rely on massive parallelization..

 

 

Orc storage index is neat and different from both Exadata and SAP HANA, The way 
I see ORC storage indexes

 

* They are combined Index and statistics. 

* Each index has statistics 

RE: ORC format

2016-02-02 Thread Mich Talebzadeh
Correct :). 

 

Lord knows how these spell checkers work sometime! Perish the thought of 
demoralising the data.

 

 

Regards,

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Lefty Leverenz [mailto:leftylever...@gmail.com] 
Sent: 02 February 2016 10:26
To: user@hive.apache.org
Subject: Re: ORC format

 

Can't resist teasing Mich about this:  "Indeed one often demoralises data 
taking advantages of massive parallel processing in Hive."

 

Surely he meant denormalizes  .  
Nobody would want to demoralise their data -- performance would suffer.  ;)




-- Lefty

 

 

On Mon, Feb 1, 2016 at 10:00 AM, Mich Talebzadeh  > wrote:

Thanks Alan for this explanation. Interesting to see Primary Key in Hive.

 

 

Sometimes comparison is made between Hive Storage Index concept in Orc and 
Oracle Exadata  storage index that also uses the same terminology!

 

It is a bit of a misnomer to call Oracle Exadata indexes a “storage index”, 
since it appears that Exadata stores data block from tables in the storage 
index, usually when they are accessed via a full-table scan.  In this context 
Exadata storage index is not a “real” index in the sense that the storage index 
exists only in RAM, and it must be re-created from scratch when the Exadata 
server is bounced.

 

Oracle Exadata  and SAP HANA as far as I know force serial scans into Hardware 
- with HANA, it is by pushing the bitmaps into the L2 cache on the chip - 
Oracle has special processors on SPARC T5 called D  that 
offloads the column bit scan off the CPU and onto separate specialized HW.  As 
a result, both rely on massive parallelization..

 

 

Orc storage index is neat and different from both Exadata and SAP HANA, The way 
I see ORC storage indexes

 

* They are combined Index and statistics. 

* Each index has statistics of min, max, count, and sum for each column 
in the row group of 10,000 rows.

* Crucially, it has the location of the start of each row group, so 
that the query can jump straight to the beginning of the row group. 

* The query can do  a SARG pushdown that limits which rows are required 
for the query and can avoid reading an entire file, or at least sections of the 
file which is by and large what a conventional RDBMS B-tree index does.

 

 

Cheers,

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its 

Hive table over S3 bucket with s3a

2016-02-02 Thread Terry Siu
Hi,

I’m wondering if anyone has found a workaround for defining a Hive table over a 
S3 bucket when the secret access key has ‘/‘ characters in it. I’m using Hive 
0.14 in HDP 2.2.4 and the statement that I used is:


CREATE EXTERNAL TABLE IF NOT EXISTS s3_foo (

  key INT, value STRING

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t’

LOCATION 's3a://:@/’;


The following error is returned:


FAILED: IllegalArgumentException The bucketName parameter must be specified.


A workaround was to set the fs.s3a.access.key and fs.s3a.secret.key 
configuration and then change the location URL to be s3a:///. 
However, this produces the following error:


FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:com.amazonaws.AmazonClientException: Unable to load AWS 
credentials from any provider in the chain)


Has anyone found a way to create a Hive over S3 table when the key contains ‘/‘ 
characters or it just standard practice to simply regenerate the keys until IAM 
returns one that doesn’t have the offending characters?


Thanks,

-Terry


Re: ORC format

2016-02-02 Thread Philip Lee
I really appreicate what you told me through this emailing-list.

Best,
Phil

On Tue, Feb 2, 2016 at 12:16 PM, Mich Talebzadeh 
wrote:

> Correct J.
>
>
>
> Lord knows how these spell checkers work sometime! Perish the thought of
> demoralising the data.
>
>
>
>
>
> Regards,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Lefty Leverenz [mailto:leftylever...@gmail.com]
> *Sent:* 02 February 2016 10:26
>
> *To:* user@hive.apache.org
> *Subject:* Re: ORC format
>
>
>
> Can't resist teasing Mich about this:  "Indeed one often demoralises data
> taking advantages of massive parallel processing in Hive."
>
>
>
> Surely he meant denormalizes
> .  Nobody would want to
> demoralise their data -- performance would suffer.  ;)
>
>
> -- Lefty
>
>
>
>
>
> On Mon, Feb 1, 2016 at 10:00 AM, Mich Talebzadeh 
> wrote:
>
> Thanks Alan for this explanation. Interesting to see Primary Key in Hive.
>
>
>
>
>
> Sometimes comparison is made between Hive Storage Index concept in Orc and
> Oracle Exadata  storage index that also uses the same terminology!
>
>
>
> It is a bit of a misnomer to call Oracle Exadata indexes a “storage
> index”, since it appears that Exadata stores data block from tables in the
> storage index, usually when they are accessed via a full-table scan.  In
> this context Exadata storage index is not a “real” index in the sense that
> the storage index exists only in RAM, and it must be re-created from
> scratch when the Exadata server is bounced.
>
>
>
> Oracle Exadata  and SAP HANA as far as I know force serial scans into
> Hardware - with HANA, it is by pushing the bitmaps into the L2 cache on the
> chip - Oracle has special processors on SPARC T5 called D 
> that offloads the column bit scan off the CPU and onto separate specialized
> HW.  As a result, both rely on massive parallelization..
>
>
>
>
>
> Orc storage index is neat and different from both Exadata and SAP HANA,
> The way I see ORC storage indexes
>
>
>
> · They are combined Index and statistics.
>
> · Each index has statistics of min, max, count, and sum for each
> column in the row group of 10,000 rows.
>
> · Crucially, it has the location of the start of each row group,
> so that the query can jump straight to the beginning of the row group.
>
> · The query can do  a SARG pushdown that limits which rows are
> required for the query and can avoid reading an entire file, or at least
> sections of the file which is by and large what a conventional RDBMS B-tree
> index does.
>
>
>
>
>
> Cheers,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4,
> volume one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is 

Re: Hive table over S3 bucket with s3a

2016-02-02 Thread Elliot West
When I last looked at this it was recommended to simply regenerate the key
as you suggest.

On 2 February 2016 at 15:52, Terry Siu  wrote:

> Hi,
>
> I’m wondering if anyone has found a workaround for defining a Hive table
> over a S3 bucket when the secret access key has ‘/‘ characters in it. I’m
> using Hive 0.14 in HDP 2.2.4 and the statement that I used is:
>
>
> CREATE EXTERNAL TABLE IF NOT EXISTS s3_foo (
>
>   key INT, value STRING
>
> )
>
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t’
>
> LOCATION 's3a://:@/’;
>
>
> The following error is returned:
>
>
> FAILED: IllegalArgumentException The bucketName parameter must be
> specified.
>
>
> A workaround was to set the fs.s3a.access.key and fs.s3a.secret.key
> configuration and then change the location URL to be
> s3a:///. However, this produces the following error:
>
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask.
> MetaException(message:com.amazonaws.AmazonClientException: Unable to load
> AWS credentials from any provider in the chain)
>
>
> Has anyone found a way to create a Hive over S3 table when the key
> contains ‘/‘ characters or it just standard practice to simply regenerate
> the keys until IAM returns one that doesn’t have the offending characters?
>
>
> Thanks,
>
> -Terry
>


Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Edward Capriolo
Hive has numerous extension points, you are not boxed in by a long shot.

On Tuesday, February 2, 2016, Koert Kuipers  wrote:

> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  > wrote:
>
>> When comparing the performance, you need to do it apple vs apple. In
>> another thread, you mentioned that Hive on Spark is much slower than Spark
>> SQL. However, you configured Hive such that only two tasks can run in
>> parallel. However, you didn't provide information on how much Spark SQL is
>> utilizing. Thus, it's hard to tell whether it's just a configuration
>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>> see the resource usage in YARN resource manage URL.
>>
>> --Xuefu
>>
>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh > > wrote:
>>
>>> Thanks Jeff.
>>>
>>>
>>>
>>> Obviously Hive is much more feature rich compared to Spark. Having said
>>> that in certain areas for example where the SQL feature is available in
>>> Spark, Spark seems to deliver faster.
>>>
>>>
>>>
>>> This may be:
>>>
>>>
>>>
>>> 1.Spark does both the optimisation and execution seamlessly
>>>
>>> 2.Hive on Spark has to invoke YARN that adds another layer to the
>>> process
>>>
>>>
>>>
>>> Now I did some simple tests on a 100Million rows ORC table available
>>> through Hive to both.
>>>
>>>
>>>
>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>
>>>
>>>
>>>
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>
>>> 1   0   0   63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>> xx
>>>
>>> 5   0   4   31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>> xx
>>>
>>> 10  99  999 188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>> xx
>>>
>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>
>>> 1   0   0   63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>> xx
>>>
>>> 5   0   4   31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>> xx
>>>
>>> 10  99  999 188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>> xx
>>>
>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 10);
>>>
>>> 1   0   0   63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
>>> xx
>>>
>>> 5   0   4   31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
>>> xx
>>>
>>> 10  99  999 188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
>>> xx
>>>
>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>
>>>
>>>
>>> So three runs returning three rows just over 50 seconds
>>>
>>>
>>>
>>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>>
>>>
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 10);
>>>
>>> INFO  :
>>>
>>> Query Hive on Spark job[4] stages:
>>>
>>> INFO  : 4
>>>
>>> INFO  :
>>>
>>> Status: Running (Hive on Spark job[4])
>>>
>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>
>>>
>>> +---+--+--+---+-+-++--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> | dummy.random_string | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +---+--+--+---+-+-++--+
>>>
>>> | 1 | 0| 0| 63|
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |  1  |
>>> xx |
>>>
>>> | 5 | 0| 4| 31|
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |  5  |
>>> xx |
>>>
>>> | 10| 99   | 999  | 188   |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  | 10  |
>>> xx |
>>>
>>>
>>> +---+--+--+---+-+-++--+
>>>
>>> 3 rows selected (82.66 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Xuefu Zhang
Yes, regardless what spark mode you're running in, from Spark AM webui, you
should be able to see how many task are concurrently running. I'm a little
surprised to see that your Hive configuration only allows 2 map tasks to
run in parallel. If your cluster has the capacity, you should parallelize
all the tasks to achieve optimal performance. Since I don't know your Spark
SQL configuration, I cannot tell how much parallelism you have over there.
Thus, I'm not sure if your comparison is valid.

--Xuefu

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh  wrote:

> Hi Jeff,
>
>
>
> In below
>
>
>
> …. You should be able to see the resource usage in YARN resource manage
> URL.
>
>
>
> Just to be clear we are talking about Port 8088/cluster?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
>