Re: Hive on TEZ fails starting

2016-01-06 Thread Rajesh Balamohan
Is the job starting and getting stuck in the reducer like you mentioned in
the initial mail? or the job itself is not starting?

~Rajesh.B

On Wed, Jan 6, 2016 at 1:47 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> Thanks for your help. I downloaded and installed snappy libraries as they
> were missing.
>
>
>
> Setting Hive execution engine to tez and doing a simple query. Hive is
> stuck
>
>
>
> > set hive.execution.engine=tez;
>
> 16/01/06 08:20:22 [main]: DEBUG parse.VariableSubstitution: Substitution
> is on: tez
>
>
>
>
>
>
>
> In debug mode I get
>
>
>
> 16/01/06 08:22:07 [IPC Client (2116259755) connection to localhost/
> 127.0.0.1:8032 from hduser]: DEBUG ipc.Client: IPC Client (2116259755)
> connection to localhost/127.0.0.1:8032 from hduser got value #188
>
> 16/01/06 08:22:07 [main]: DEBUG ipc.ProtobufRpcEngine: Call:
> getApplicationReport took 1ms
>
> 16/01/06 08:22:07 [LeaseRenewer:hduser@rhes564:9000]: DEBUG
> hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 executed
>
>
>
>
>
> Which does not progress. Sounds like the same issue as I saw with Spark.
>
>
>
> Can I do something like?
>
>
>
> set tez.master=yarn-client;
>
>
>
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com]
> *Sent:* 05 January 2016 11:46
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on TEZ fails starting
>
>
>
> Try '  beeline --hiveconf tez.task.launch.env="
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
> --hiveconf 
> tez.am.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
> '.  Please check if you have the lib*.so available in the native folder (or
> point it to the folder which contains the so files)
>
>
>
> ~Rajesh.B
>
>
>
>
>
> On Tue, Jan 5, 2016 at 4:00 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
>
>
> I have added the following to the LD_LIBRARY_PATH and JAVA_LIBRARY_PATH
>
>
>
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native
>
> export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native
>
>
>
> Trying to use TEZ, I still get the same error
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=tez;
>
> No rows affected (0.002 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> use oraclehadoop;
>
> No rows affected (0.019 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select count(1) from sales;
>
> INFO  : Tez session hasn't been created yet. Opening session
>
> INFO  :
>
>
>
> INFO  : Status: Running (Executing on YARN cluster with App id
> application_1451986680090_0002)
>
>
>
> INFO  : Map 1: -/-  Reducer 2: 0/1
>
> INFO  : Map 1: 0/1  Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1
>
>
>
> ERROR : Vertex failed, vertexName=Map 1,
> vertexId=vertex_1451986680090_0002_1_00, diagnostics=[Task failed,
> taskId=task_1451986680090_0002_1_00_00, diagnostics=[TaskAttempt 0
> failed, info=[Error: Failure while running task:java.lang.RuntimeException:
> java.lang.UnsatisfiedLinkError:
> org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> 

RE: Hive on TEZ fails starting

2016-01-06 Thread Mich Talebzadeh
Not starting at all!

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com] 
Sent: 06 January 2016 09:18
To: user@hive.apache.org
Subject: Re: Hive on TEZ fails starting

 

Is the job starting and getting stuck in the reducer like you mentioned in the 
initial mail? or the job itself is not starting?  

 

~Rajesh.B

 

On Wed, Jan 6, 2016 at 1:47 PM, Mich Talebzadeh  > wrote:

Hi,

 

Thanks for your help. I downloaded and installed snappy libraries as they were 
missing.

 

Setting Hive execution engine to tez and doing a simple query. Hive is stuck

 

> set hive.execution.engine=tez;

16/01/06 08:20:22 [main]: DEBUG parse.VariableSubstitution: Substitution is on: 
tez

 

 

 

In debug mode I get

 

16/01/06 08:22:07 [IPC Client (2116259755  ) connection to 
localhost/127.0.0.1:8032   from hduser]: DEBUG 
ipc.Client: IPC Client (2116259755  ) connection to 
localhost/127.0.0.1:8032   from hduser got value #188

16/01/06 08:22:07 [main]: DEBUG ipc.ProtobufRpcEngine: Call: 
getApplicationReport took 1ms

16/01/06 08:22:07 [LeaseRenewer:hduser@rhes564:9000]: DEBUG hdfs.LeaseRenewer: 
Lease renewer daemon for [] with renew id 1 executed

 

 

Which does not progress. Sounds like the same issue as I saw with Spark.

 

Can I do something like?

 

set tez.master=yarn-client;

 

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com 
 ] 
Sent: 05 January 2016 11:46


To: user@hive.apache.org  
Subject: Re: Hive on TEZ fails starting

 

Try '  beeline --hiveconf 
tez.task.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
 --hiveconf 
tez.am.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
 '.  Please check if you have the lib*.so available in the native folder (or 
point it to the folder which contains the so files)

 

~Rajesh.B

 

 

On Tue, Jan 5, 2016 at 4:00 PM, Mich Talebzadeh  > wrote:

Hi,

 

I have added the following to the LD_LIBRARY_PATH and JAVA_LIBRARY_PATH

 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native

export 

RE: Hive on TEZ fails starting

2016-01-06 Thread Mich Talebzadeh
Apologies it starts OK and fails at reducer!

 



Map 1RUNNING  1  010   3   0

Reducer 2 INITED  1  001   0   
0p: application_1452068056412_0006 dag:dag_1452068056412_0006_1



VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED



16/01/06 15:53:53 [main]: DEBUG rpc.DAGClientRPCImpl: GetVertexStatus via AM 
for app: application_1452068056412_0006 dag: dag_1452068056412_0006_1 vertex: 
Map 1

16/01/06 15:53:53 [IPC Parameter Sending Thread #0]: DEBUG ipc.Client: IPC 
Client (1353540671) connection to rhes564/50.140.197.217:56764 from hduser 
sending #152

16/01/06 15:53:53 [IPC Client (1353540671) connection to 
rhes564/50.140.197.217:56764 from hduser]: DEBUG ipc.Client: IPC Client 
(1353540671) connection to rhes564/50.140.197.217:56764 from hduser got value 
#152

16/01/06 15:53:53 [main]: DEBUG ipc.ProtobufRpcEngine: Call: getVertexStatus 
took 10msfrom hduser]: DEBUG ipc.Client: IPC Client (1353540671) connection to 
rhes564/50.140.197.217:56764 from hduser got value #151

16/01/06 15:53:53 [main]: DEBUG rpc.DAGClientRPCImpl: GetVertexStatus via AM 
for app: application_1452068056412_0006 dag: dag_1452068056412_0006_1 vertex: 
Reducer 2

16/01/06 15:53:53 [IPC Parameter Sending Thread #0]: DEBUG ipc.Client: IPC 
Client (1353540671) connection to rhes564/50.140.197.217:56764 from hduser 
sending #153

16/01/06 15:53:53 [IPC Client (1353540671) connection to 
rhes564/50.140.197.217:56764 from hduser]: DEBUG ipc.Client: IPC Client 
(1353540671) connection to rhes564/50.140.197.217:56764 from hduser got value 
#153

16/01/06 15:53:53 [main]: DEBUG ipc.ProtobufRpcEngine: Call: getVertexStatus 
took 2ms

Map 1 FAILED  1  001   4   0

Reducer 2 KILLED  1  001   0   0



VERTICES: 00/02  [>>--] 0%ELAPSED TIME: 13.31 s



16/01/06 15:53:53 [main]: INFO SessionState: Map 1: 0(+0,-4)/1  Reducer 2: 0/1

Status: Failed

16/01/06 15:53:53 [main]: ERROR SessionState: Status: Failed

Vertex failed, vertexName=Map 1, vertexId=vertex_1452068056412_0006_1_00, 
diagnostics=[Task failed, taskId=task_1452068056412_0006_1_00_00, 
diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running 
task:java.lang.RuntimeException: java.lang.UnsatisfiedLinkError: 
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:157)

at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)

at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)

at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)

at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)

at java.util.concurrent.FutureTask.run(FutureTask.java:166)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:724)

Caused by: java.lang.UnsatisfiedLinkError: 
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event 

Re: Indexes in Hive

2016-01-06 Thread Alan Gates
The issue with this is that HDFS lacks the ability to co-locate blocks.  
So if you break your columns into one file per column (the more 
traditional column route) you end up in a situation where 2/3 of the 
time only one of your columns is being locally read, which results in a 
significant performance penalty.  That's why ORC and Parquet and RCFile 
all use one file for their "columnar" stores.


Alan.


Mich Talebzadeh 
January 5, 2016 at 22:24
Hi,

Thinking loudly.

Ideally we should consider a totally columnar storage offering in 
which each

column of table is stored as compressed value (I disregard for now how
actually ORC does this but obviously it is not exactly a columnar 
storage).


So each table can be considered as a loose federation of columnar storage
and each column is effectively an index?

As columns are far narrower than tables, each index block will be very
higher density and all operations like aggregates can be done directly on
index rather than table.

This type of table offering will be in true nature of data warehouse
storage. Of course row operations (get me all rows for this table) will be
slower but that is the trade-off that we need to consider.

Expecting users to write their own IndexHandler may be technically
interesting but commercially not viable as Hive needs to be a product 
on its

own merit not a development base. Writing your own storage attributes etc.
requires skills that will put off people seeing Hive as an attractive
proposition (requiring considerable investment in skill sets in order to
maintain Hive).

Thus my thinking on this is to offer true columnar storage in Hive to be a
proper data warehouse. In addition, the development tools cab ne made
available for those interested in tailoring their own specific Hive
solutions.


HTH



Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 
15",

ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale 
Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. 
It is
the responsibility of the recipient to ensure that this email is virus 
free,
therefore neither Peridale Ltd, its subsidiaries nor their employees 
accept

any responsibility.


-Original Message-
From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of 
Gopal

Vijayaraghavan
Sent: 05 January 2016 23:55
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?


now?

The builtin indexes - those that write data as smaller tables are only
useful in a pre-columnar world, where the indexes offer a huge 
reduction in

IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index.

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even if 
there's

a non-stop stream of updates into it.

Cheers,
Gopal



Re: Indexes in Hive

2016-01-06 Thread Jörn Franke
I am not sure how much performance one could gain in comparison to ORC or
Parquet. They work pretty well once you know how to use them. However,
there is still ways to optimize them. For instance, sorting of data is a
key factor for these formats to be efficient. Nevertheless, if you have a
lot of columns then sorting each column individually does not make sense.
Here one could explore a sorting algorithm that, for instance, identifies
certain groups of values that are often queried together and co-locates
them. Alternatively, you can create for each row a hash sum over often
queried columns and do pruning only based on this hashsum (=one column).
This can be already done in Hive. Another alternative is to create
redundant tables and each of them sorted differently. This may be
implemented in Hive automatically depending on query patterns.

I think there is sometimes also a wrong perception of what is possible with
Big Data. If you query a petabyte of data and the query processes the whole
amount then you need a lot of nodes or simply live with the fact that it
takes longer. However, in many of the cases users usually query more data
then they need. There are also many cases where they could just work with
samples from the table to define their models and later let them evaluate
over the whole set of data over night. Additionally, usually there is not
one user, but many.

Given this, column-orientation does not solve everything anyway. It is just
one little part of the big picture. For example, graph structures, such as
provided by TitanDB with interactive Gremlin queries, can be much faster
executed for certain scenarios than in a column store.
Interactive In-memory technologies, such as Apache Ignite, can speed up
Hive or even Spark if you have a lot of users or processes that share data.
I think TEZ+LLAP show also some interesting features for Hive related to
this.


In my blog you can find some discussion on how to optimize for big data
technologies in general.

On Wed, Jan 6, 2016 at 7:24 AM, Mich Talebzadeh  wrote:

> Hi,
>
> Thinking loudly.
>
> Ideally we should consider a totally columnar storage offering in which
> each
> column of table is stored as compressed value (I disregard for now how
> actually ORC does this but obviously it is not exactly a columnar storage).
>
> So each table can be considered as a loose federation of columnar storage
> and each column is effectively an index?
>
> As columns are far narrower than tables, each index block will be very
> higher density and all operations like aggregates can be done directly on
> index rather than table.
>
> This type of table offering will be in true nature of data warehouse
> storage. Of course row operations (get me all rows for this table) will be
> slower but that is the trade-off that we need to consider.
>
> Expecting users to write their own IndexHandler may be technically
> interesting but commercially not viable as Hive needs to be a product on
> its
> own merit not a development base. Writing your own storage attributes etc.
> requires skills that will put off people seeing Hive as an attractive
> proposition (requiring considerable investment in skill sets in order to
> maintain Hive).
>
> Thus my thinking on this is to offer true columnar storage in Hive to be a
> proper data warehouse. In addition, the development tools cab ne made
> available for those interested in tailoring their own specific Hive
> solutions.
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
>
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
> V8Pw
>
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
> -Original Message-
> From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of
> Gopal
> Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: 

RE: Hive on TEZ fails starting

2016-01-06 Thread Mich Talebzadeh
Hi,

 

Thanks for your help. I downloaded and installed snappy libraries as they were 
missing.

 

Setting Hive execution engine to tez and doing a simple query. Hive is stuck

 

> set hive.execution.engine=tez;

16/01/06 08:20:22 [main]: DEBUG parse.VariableSubstitution: Substitution is on: 
tez

 

 

 

In debug mode I get

 

16/01/06 08:22:07 [IPC Client (2116259755) connection to 
localhost/127.0.0.1:8032 from hduser]: DEBUG ipc.Client: IPC Client 
(2116259755) connection to localhost/127.0.0.1:8032 from hduser got value #188

16/01/06 08:22:07 [main]: DEBUG ipc.ProtobufRpcEngine: Call: 
getApplicationReport took 1ms

16/01/06 08:22:07 [LeaseRenewer:hduser@rhes564:9000]: DEBUG hdfs.LeaseRenewer: 
Lease renewer daemon for [] with renew id 1 executed

 

 

Which does not progress. Sounds like the same issue as I saw with Spark.

 

Can I do something like?

 

set tez.master=yarn-client;

 

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com] 
Sent: 05 January 2016 11:46
To: user@hive.apache.org
Subject: Re: Hive on TEZ fails starting

 

Try '  beeline --hiveconf 
tez.task.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
 --hiveconf 
tez.am.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
 '.  Please check if you have the lib*.so available in the native folder (or 
point it to the folder which contains the so files)

 

~Rajesh.B

 

 

On Tue, Jan 5, 2016 at 4:00 PM, Mich Talebzadeh  > wrote:

Hi,

 

I have added the following to the LD_LIBRARY_PATH and JAVA_LIBRARY_PATH

 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native

 

Trying to use TEZ, I still get the same error

 

0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=tez;

No rows affected (0.002 seconds)

0: jdbc:hive2://rhes564:10010/default> use oraclehadoop;

No rows affected (0.019 seconds)

0: jdbc:hive2://rhes564:10010/default> select count(1) from sales;

INFO  : Tez session hasn't been created yet. Opening session

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id 
application_1451986680090_0002)

 

INFO  : Map 1: -/-  Reducer 2: 0/1

INFO  : Map 1: 0/1  Reducer 2: 0/1

INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1

INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1

 

ERROR : Vertex failed, vertexName=Map 1, 
vertexId=vertex_1451986680090_0002_1_00, diagnostics=[Task failed, 
taskId=task_1451986680090_0002_1_00_00, diagnostics=[TaskAttempt 0 failed, 
info=[Error: Failure while running task:java.lang.RuntimeException: 
java.lang.UnsatisfiedLinkError: 
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8


last_modified_time and transient_lastDdlTime - what is transient_lastDdlTime for.

2016-01-06 Thread Ophir Etzion
I want to know for each of my tables the last time it was modified. some of
my tables don't have last_modified_time in the table parameters but all
have transient_lastDdlTime.
transient_lastDdlTime seems to be the same as last_modified_time in some of
the tables I randomly cheked.

what is the time in transient_lastDdlTime? if it also the modified time why
is there also last_modified_time?

Thanks,
Ophir


RE: last_modified_time and transient_lastDdlTime - what is transient_lastDdlTime for.

2016-01-06 Thread Mich Talebzadeh
 

When table is created it is the time stamp when the table was created. When any 
DDL is done it is the last DDL time  it looks

 

0: jdbc:hive2://rhes564:10010/default> create table test (col1 int, col2 
string);

No rows affected (0.168 seconds)

0: jdbc:hive2://rhes564:10010/default> show create table test;

+---+--+

|  createtab_stmt   |

+---+--+

| CREATE TABLE `test`(  |

|   `col1` int, |

|   `col2` string)  |

| ROW FORMAT SERDE  |

|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'|

| STORED AS INPUTFORMAT |

|   'org.apache.hadoop.mapred.TextInputFormat'  |

| OUTPUTFORMAT  |

|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'|

| LOCATION  |

|   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/test'  |

| TBLPROPERTIES (   |

|   'transient_lastDdlTime'='1452120468')   |

+---+--+

13 rows selected (0.06 seconds)

0: jdbc:hive2://rhes564:10010/default> insert into test values(1,'a');

INFO  : Table oraclehadoop.test stats: [numFiles=1, numRows=1, totalSize=4, 
rawDataSize=3]

No rows affected (1.517 seconds)

0: jdbc:hive2://rhes564:10010/default> show create table test;

+---+--+

|  createtab_stmt   |

+---+--+

| CREATE TABLE `test`(  |

|   `col1` int, |

|   `col2` string)  |

| ROW FORMAT SERDE  |

|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'|

| STORED AS INPUTFORMAT |

|   'org.apache.hadoop.mapred.TextInputFormat'  |

| OUTPUTFORMAT  |

|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'|

| LOCATION  |

|   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/test'  |

| TBLPROPERTIES (   |

|   'COLUMN_STATS_ACCURATE'='true', |

|   'numFiles'='1', |

|   'numRows'='1',  |

|   'rawDataSize'='3',  |

|   'totalSize'='4',|

|   'transient_lastDdlTime'='1452120510')   |

+---+--+

 

0: jdbc:hive2://rhes564:10010/default> select cast(from_unixtime(1452120468) AS 
timestamp);

++--+

|  _c0   |

++--+

| 2016-01-06 22:47:48.0  |

++--+

 

0: jdbc:hive2://rhes564:10010/default> select cast(from_unixtime(1452120510) AS 
timestamp);

++--+

|  _c0   |

++--+

| 2016-01-06 22:48:30.0  |

++--+

 

 

Dr Mich Talebzadeh

 

LinkedIn   

 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 

RE: Indexes in Hive

2016-01-06 Thread Mich Talebzadeh
Thanks guys

 

A typical columnar database stores data by breaking the rows of a table into
individual columns and storing the successive values in an indexed and
compressed form in data blocks. The nth row of the table can be
reconstituted by taking the nth element from each column heap

 

So data is broken into individual columns. Every column is stored as an
index, the type varying based on the native data type and cardinality (the
number of distinct values) of the underlying column. Further, since each
column occupies its own data blocks, those blocks can be compressed, again
based on the data type and index it is stored in. The Row ID (a block number
and offset) threads all of the bits of data that comprise a row together
without having to maintain any physical co-location at all. That is very
important.

 

The above essentially means that data blocks for each column have to be
contiguous. This may be challenging in HDFS because by definition a
distributed file system like HDFS cannot maintain that strict ordering of
blocks. However, can this be achieved without comprising the redundancy? May
be the location of these contiguous blocks can be maintained in NameNode in
some efficient way. If the optimiser becomes aware of this storage ordering
then column operations should be very efficient. Additionally one can create
indexes associated with these columns. It is important to remember these
additional indexes will be optimized for "single columns only" and  in some
cases they do not even need to store the underlying data value. 

 

The drawback would be that queries requiring full row operations will by
definition be inefficient together with update operations. However, I think
if it is achieved it will be a great plus for Hive.

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Alan Gates [mailto:alanfga...@gmail.com] 
Sent: 06 January 2016 18:19
To: user@hive.apache.org
Subject: Re: Indexes in Hive

 

The issue with this is that HDFS lacks the ability to co-locate blocks.  So
if you break your columns into one file per column (the more traditional
column route) you end up in a situation where 2/3 of the time only one of
your columns is being locally read, which results in a significant
performance penalty.  That's why ORC and Parquet and RCFile all use one file
for their "columnar" stores.

Alan.






  Mich Talebzadeh

January 5, 2016 at 22:24

Hi,

Thinking loudly.

Ideally we should consider a totally columnar storage offering in which each
column of table is stored as compressed value (I disregard for now how
actually ORC does this but obviously it is not exactly a columnar storage).

So each table can be considered as a loose federation of columnar storage
and each column is effectively an index?

As columns are far narrower than tables, each index block will be very
higher density and all operations like aggregates can be done directly on
index rather than table. 

This type of table offering will be in true nature of data warehouse
storage. Of course row operations (get me all rows for this table) will be
slower but that is the trade-off that we need to consider.

Expecting users to write their own IndexHandler may be technically
interesting but commercially not viable as Hive needs to be a product on its
own merit not a development base. Writing your own storage attributes etc.
requires skills that will put off people seeing Hive as an attractive
proposition (requiring considerable investment in skill sets in order to
maintain Hive).

Thus my thinking on this is to offer true columnar storage in Hive to be a
proper data warehouse. In addition, the development tools cab ne made
available