RE: Indexes in Hive

2016-01-05 Thread Mich Talebzadeh
I believe so Jorn. 

I am not sure how much it differs from ORC file storage?

Cheers,

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-Original Message-
From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 06 January 2016 07:49
To: user@hive.apache.org
Subject: Re: Indexes in Hive

If I understand you correctly this could be just another Hive storage
format.

> On 06 Jan 2016, at 07:24, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> Thinking loudly.
> 
> Ideally we should consider a totally columnar storage offering in 
> which each column of table is stored as compressed value (I disregard 
> for now how actually ORC does this but obviously it is not exactly a
columnar storage).
> 
> So each table can be considered as a loose federation of columnar 
> storage and each column is effectively an index?
> 
> As columns are far narrower than tables, each index block will be very 
> higher density and all operations like aggregates can be done directly 
> on index rather than table.
> 
> This type of table offering will be in true nature of data warehouse 
> storage. Of course row operations (get me all rows for this table) 
> will be slower but that is the trade-off that we need to consider.
> 
> Expecting users to write their own IndexHandler may be technically 
> interesting but commercially not viable as Hive needs to be a product 
> on its own merit not a development base. Writing your own storage
attributes etc.
> requires skills that will put off people seeing Hive as an attractive 
> proposition (requiring considerable investment in skill sets in order 
> to maintain Hive).
> 
> Thus my thinking on this is to offer true columnar storage in Hive to 
> be a proper data warehouse. In addition, the development tools cab ne 
> made available for those interested in tailoring their own specific 
> Hive solutions.
> 
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCC
> dOABUr
> V8Pw
> 
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15 
>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 
> 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
> volume one out shortly
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. 
> This message is for the designated recipient only, if you are not the 
> intended recipient, you should destroy it immediately. Any information 
> in this message shall not be understood as given or endorsed by 
> Peridale Technology Ltd, its subsidiaries or their employees, unless 
> expressly so stated. It is the responsibility of the recipient to 
> ensure that this email is virus free, therefore neither Peridale Ltd, 
> its subsidiaries nor their employees accept any responsibility.
> 
> 
> -Original Message-
> From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of 
> Gopal Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: user@hive.apache.org
> Subject: Re: Is Hive Index officially not recommended?
> 
> 
>> So in a nutshell in Hive if "external" indexes are not used for 
>> improving query response, what value they add and can we forget them 
>> for
> now?
> 
> The builtin indexes - those that write data as smaller tables are only 
> useful in a pre-columnar world, where the indexes offer a huge 
> reduction in IO.
> 
> Part #1 of 

Re: Indexes in Hive

2016-01-05 Thread Jörn Franke
If I understand you correctly this could be just another Hive storage format.

> On 06 Jan 2016, at 07:24, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> Thinking loudly.
> 
> Ideally we should consider a totally columnar storage offering in which each
> column of table is stored as compressed value (I disregard for now how
> actually ORC does this but obviously it is not exactly a columnar storage).
> 
> So each table can be considered as a loose federation of columnar storage
> and each column is effectively an index?
> 
> As columns are far narrower than tables, each index block will be very
> higher density and all operations like aggregates can be done directly on
> index rather than table. 
> 
> This type of table offering will be in true nature of data warehouse
> storage. Of course row operations (get me all rows for this table) will be
> slower but that is the trade-off that we need to consider.
> 
> Expecting users to write their own IndexHandler may be technically
> interesting but commercially not viable as Hive needs to be a product on its
> own merit not a development base. Writing your own storage attributes etc.
> requires skills that will put off people seeing Hive as an attractive
> proposition (requiring considerable investment in skill sets in order to
> maintain Hive).
> 
> Thus my thinking on this is to offer true columnar storage in Hive to be a
> proper data warehouse. In addition, the development tools cab ne made
> available for those interested in tailoring their own specific Hive
> solutions.
> 
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
> V8Pw
> 
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
> ISBN 978-0-9563693-0-7. 
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
> 
> 
> -Original Message-
> From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal
> Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: user@hive.apache.org
> Subject: Re: Is Hive Index officially not recommended?
> 
> 
>> So in a nutshell in Hive if "external" indexes are not used for 
>> improving query response, what value they add and can we forget them for
> now?
> 
> The builtin indexes - those that write data as smaller tables are only
> useful in a pre-columnar world, where the indexes offer a huge reduction in
> IO.
> 
> Part #1 of using hive indexes effectively is to write your own
> HiveIndexHandler, with usesIndexTable=false;
> 
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary
> lookups into other range conditions.
> 
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
> 
> Some of the index data now lives in the HBase metastore, so that the
> inclusion/exclusion of whole partitions can be done off the consolidated
> index. 
> 
> https://issues.apache.org/jira/browse/HIVE-11676
> 
> 
> The experience from BI workloads run by customers is that in general, the
> lookup to the right "slice" of data is more of a problem than the actual
> aggregate.
> 
> And that for a workhorse data warehouse, this has to survive even if there's
> a non-stop stream of updates into it.
> 
> Cheers,
> Gopal
> 


Indexes in Hive

2016-01-05 Thread Mich Talebzadeh
Hi,

Thinking loudly.

Ideally we should consider a totally columnar storage offering in which each
column of table is stored as compressed value (I disregard for now how
actually ORC does this but obviously it is not exactly a columnar storage).

So each table can be considered as a loose federation of columnar storage
and each column is effectively an index?

As columns are far narrower than tables, each index block will be very
higher density and all operations like aggregates can be done directly on
index rather than table. 

This type of table offering will be in true nature of data warehouse
storage. Of course row operations (get me all rows for this table) will be
slower but that is the trade-off that we need to consider.

Expecting users to write their own IndexHandler may be technically
interesting but commercially not viable as Hive needs to be a product on its
own merit not a development base. Writing your own storage attributes etc.
requires skills that will put off people seeing Hive as an attractive
proposition (requiring considerable investment in skill sets in order to
maintain Hive).

Thus my thinking on this is to offer true columnar storage in Hive to be a
proper data warehouse. In addition, the development tools cab ne made
available for those interested in tailoring their own specific Hive
solutions.


HTH



Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-Original Message-
From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal
Vijayaraghavan
Sent: 05 January 2016 23:55
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?


>So in a nutshell in Hive if "external" indexes are not used for 
>improving query response, what value they add and can we forget them for
now?

The builtin indexes - those that write data as smaller tables are only
useful in a pre-columnar world, where the indexes offer a huge reduction in
IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index. 

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even if there's
a non-stop stream of updates into it.

Cheers,
Gopal



Re: Is Hive Index officially not recommended?

2016-01-05 Thread Lefty Leverenz
I'd like to revise the Indexing
 and
IndexDev  docs
in the wiki to include this information (as well as information from a
previous thread, if I can find it) so people won't be misled into using
indexes inappropriately.

But it might be more efficient for Gopal or another expert to do the
revisions.  Otherwise I would need careful reviews to make sure I don't
garble things.

-- Lefty


On Tue, Jan 5, 2016 at 3:55 PM, Gopal Vijayaraghavan 
wrote:

>
> >So in a nutshell in Hive if "external" indexes are not used for improving
> >query response, what value they add and can we forget them for now?
>
> The builtin indexes - those that write data as smaller tables are only
> useful in a pre-columnar world, where the indexes offer a huge reduction
> in IO.
>
> Part #1 of using hive indexes effectively is to write your own
> HiveIndexHandler, with usesIndexTable=false;
>
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary
> lookups into other range conditions.
>
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
>
> Some of the index data now lives in the HBase metastore, so that the
> inclusion/exclusion of whole partitions can be done off the consolidated
> index.
>
> https://issues.apache.org/jira/browse/HIVE-11676
>
>
> The experience from BI workloads run by customers is that in general, the
> lookup to the right "slice" of data is more of a problem than the actual
> aggregate.
>
> And that for a workhorse data warehouse, this has to survive even if
> there's a non-stop stream of updates into it.
>
> Cheers,
> Gopal
>
>
>


Re: Is Hive Index officially not recommended?

2016-01-05 Thread Gopal Vijayaraghavan

>So in a nutshell in Hive if "external" indexes are not used for improving
>query response, what value they add and can we forget them for now?

The builtin indexes - those that write data as smaller tables are only
useful in a pre-columnar world, where the indexes offer a huge reduction
in IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index. 

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even if
there's a non-stop stream of updates into it.

Cheers,
Gopal




RE: Hive on TEZ fails starting

2016-01-05 Thread Artem Ervits
Check if you have conflicting java versions
On Jan 5, 2016 5:27 PM, "Mich Talebzadeh"  wrote:

> Hi Rajesh,
>
>
>
> This is what I have under :$HADOOP_COMMON_HOME/lib/native
>
>
>
> cd $HADOOP_COMMON_HOME/lib/native
>
> hduser@rhes564::/home/hduser/hadoop-2.6.0/lib/native> ls -ltr
>
> total 4936
>
> -rwxr-xr-x 1 hduser hadoop  278622 Nov 13  2014 libhdfs.so.0.0.0
>
> -rw-r-xr-x 1 hduser hadoop  440498 Nov 13  2014 libhdfs.a
>
> -rw-r-xr-x 1 hduser hadoop  47 Nov 13  2014 libhadooputils.a
>
> -rw-r-xr-x 1 hduser hadoop 1634592 Nov 13  2014 libhadooppipes.a
>
> -rwxr-xr-x 1 hduser hadoop  805999 Nov 13  2014 libhadoop.so.1.0.0
>
> -rw-r-xr-x 1 hduser hadoop 1380212 Nov 13  2014 libhadoop.a
>
> lrwxrwxrwx 1 hduser hadoop  16 Feb  8  2015 libhdfs.so ->
> libhdfs.so.0.0.0
>
> lrwxrwxrwx 1 hduser hadoop  18 Feb  8  2015 libhadoop.so ->
> libhadoop.so.1.0.0
>
>
>
>
>
> And If I do a search for snappy I get
>
>
>
> hduser@rhes564::/home/hduser/hadoop-2.6.0> find ./ -name '*snappy*'
>
> ./share/hadoop/common/lib/snappy-java-1.0.4.1.jar
>
> ./share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/snappy-java-1.0.4.1.jar
>
>
> ./share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/snappy-java-1.0.4.1.jar
>
> ./share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar
>
> ./share/hadoop/tools/lib/snappy-java-1.0.4.1.jar
>
>
>
>
>
> Thanks,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com]
> *Sent:* 05 January 2016 11:46
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on TEZ fails starting
>
>
>
> Try '  beeline --hiveconf tez.task.launch.env="
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
> --hiveconf 
> tez.am.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
> '.  Please check if you have the lib*.so available in the native folder (or
> point it to the folder which contains the so files)
>
>
>
> ~Rajesh.B
>
>
>
>
>
> On Tue, Jan 5, 2016 at 4:00 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
>
>
> I have added the following to the LD_LIBRARY_PATH and JAVA_LIBRARY_PATH
>
>
>
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native
>
> export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native
>
>
>
> Trying to use TEZ, I still get the same error
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=tez;
>
> No rows affected (0.002 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> use oraclehadoop;
>
> No rows affected (0.019 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select count(1) from sales;
>
> INFO  : Tez session hasn't been created yet. Opening session
>
> INFO  :
>
>
>
> INFO  : Status: Running (Executing on YARN cluster with App id
> application_1451986680090_0002)
>
>
>
> INFO  : Map 1: -/-  Reducer 2: 0/1
>
> INFO  : Map 1: 0/1  Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1
>
>
>
> ERROR : Vertex failed, vertexName=Map 1,
> vertexId=vertex_1451986680090_0002_1_00, diagnostics=[Task failed,
> taskId=task_1451986680090_0002_1_00_00, diagnostics=[TaskAttempt 0
> failed, info=[Error: Failure while running task:java.lang.RuntimeException:
> java.lang.UnsatisfiedLinkError:
> org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
>
>
>
>
>

RE: Hive on TEZ fails starting

2016-01-05 Thread Mich Talebzadeh
Hi Rajesh,

 

This is what I have under :$HADOOP_COMMON_HOME/lib/native

 

cd $HADOOP_COMMON_HOME/lib/native

hduser@rhes564::/home/hduser/hadoop-2.6.0/lib/native> ls -ltr

total 4936

-rwxr-xr-x 1 hduser hadoop  278622 Nov 13  2014 libhdfs.so.0.0.0

-rw-r-xr-x 1 hduser hadoop  440498 Nov 13  2014 libhdfs.a

-rw-r-xr-x 1 hduser hadoop  47 Nov 13  2014 libhadooputils.a

-rw-r-xr-x 1 hduser hadoop 1634592 Nov 13  2014 libhadooppipes.a

-rwxr-xr-x 1 hduser hadoop  805999 Nov 13  2014 libhadoop.so.1.0.0

-rw-r-xr-x 1 hduser hadoop 1380212 Nov 13  2014 libhadoop.a

lrwxrwxrwx 1 hduser hadoop  16 Feb  8  2015 libhdfs.so -> libhdfs.so.0.0.0

lrwxrwxrwx 1 hduser hadoop  18 Feb  8  2015 libhadoop.so -> 
libhadoop.so.1.0.0

 

 

And If I do a search for snappy I get

 

hduser@rhes564::/home/hduser/hadoop-2.6.0> find ./ -name '*snappy*'

./share/hadoop/common/lib/snappy-java-1.0.4.1.jar

./share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/snappy-java-1.0.4.1.jar

./share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/snappy-java-1.0.4.1.jar

./share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar

./share/hadoop/tools/lib/snappy-java-1.0.4.1.jar

 

 

Thanks,

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com] 
Sent: 05 January 2016 11:46
To: user@hive.apache.org
Subject: Re: Hive on TEZ fails starting

 

Try '  beeline --hiveconf 
tez.task.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
 --hiveconf 
tez.am.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native"
 '.  Please check if you have the lib*.so available in the native folder (or 
point it to the folder which contains the so files)

 

~Rajesh.B

 

 

On Tue, Jan 5, 2016 at 4:00 PM, Mich Talebzadeh mailto:m...@peridale.co.uk> > wrote:

Hi,

 

I have added the following to the LD_LIBRARY_PATH and JAVA_LIBRARY_PATH

 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native

 

Trying to use TEZ, I still get the same error

 

0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=tez;

No rows affected (0.002 seconds)

0: jdbc:hive2://rhes564:10010/default> use oraclehadoop;

No rows affected (0.019 seconds)

0: jdbc:hive2://rhes564:10010/default> select count(1) from sales;

INFO  : Tez session hasn't been created yet. Opening session

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id 
application_1451986680090_0002)

 

INFO  : Map 1: -/-  Reducer 2: 0/1

INFO  : Map 1: 0/1  Reducer 2: 0/1

INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1

INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1

 

ERROR : Vertex failed, vertexName=Map 1, 
vertexId=vertex_1451986680090_0002_1_00, diagnostics=[Task failed, 
taskId=task_1451986680090_0002_1_00_00, diagnostics=[TaskAttempt 0 failed, 
info=[Error: Failure while running task:java.lang.RuntimeException: 
java.lang.UnsatisfiedLinkError: 
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase 

RE: Is Hive Index officially not recommended?

2016-01-05 Thread Mich Talebzadeh
Thanks Gopal for a very valuable insight.

So in a nutshell in Hive if "external" indexes are not used for improving
query response, what value they add and can we forget them for now?

Regards,

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-Original Message-
From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal
Vijayaraghavan
Sent: 05 January 2016 21:49
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?


 
 
> I am going to run the same query in Hive. However, I only see a table 
>scan below and no mention of that index. May be I am missing something 
>here?

Hive Indexes are an incomplete feature, because they are not maintained over
an ACID storage & demand FileSystem access to check for validity.

I'm almost sure there's a better implementation, which never made it to
Apache (read HIVE-417 & comments about HBase).


So far, in all my prod cases, they've slowed down queries more often than
speeding them up.

By default, the indexes are *not* used to answer queries.

In fact, the slowness was mostly attributed to the time spent making sure
the index was invalid.

You can flip those on if you want mostly up-to date results.

set hive.optimize.index.filter=true;
set hive.optimize.index.groupby=true;

set hive.index.compact.query.max.size=-1;

set hive.optimize.index.filter.compact.minsize=-1;

set hive.index.compact.query.max.entries=-1;

Things are going to change in Hive-2.0 though. The addition of isolated
transactions brings new light into the world of indexes.

I'll be chasing that down after LLAP, since the txn model offers
serializability markers and the LockManager + compactions offer a great way
to purge/update them per-partition. And the metastore-2.0 removes a large
number of scalability problems associated with metadata.

 
Cheers,
Gopal






Re: Is Hive Index officially not recommended?

2016-01-05 Thread Gopal Vijayaraghavan

 
 
> I am going to run the same query in Hive. However, I only see a table
>scan below and no mention of that index. May be I am missing something
>here?

Hive Indexes are an incomplete feature, because they are not maintained
over an ACID storage & demand FileSystem access to check for validity.

I'm almost sure there's a better implementation, which never made it to
Apache (read HIVE-417 & comments about HBase).


So far, in all my prod cases, they've slowed down queries more often than
speeding them up.

By default, the indexes are *not* used to answer queries.

In fact, the slowness was mostly attributed to the time spent making sure
the index was invalid.

You can flip those on if you want mostly up-to date results.

set hive.optimize.index.filter=true;
set hive.optimize.index.groupby=true;

set hive.index.compact.query.max.size=-1;

set hive.optimize.index.filter.compact.minsize=-1;

set hive.index.compact.query.max.entries=-1;

Things are going to change in Hive-2.0 though. The addition of isolated
transactions brings new light into the world of indexes.

I'll be chasing that down after LLAP, since the txn model offers
serializability markers and the LockManager + compactions offer a great
way to purge/update them per-partition. And the metastore-2.0 removes a
large number of scalability problems associated with metadata.

 
Cheers,
Gopal







RE: Is Hive Index officially not recommended?

2016-01-05 Thread Mich Talebzadeh
Hi,

 

You point below:

 

The "traditional" indexes can still make sense for data not in Orc or parquet 
format.

 

Kindly consider below please

 

A traditional index in an RDBMs is normally a B-tree index with a value for 
that column and pointer (Row ID)to the row in the data block that keeps the 
data.

 

 

In RRDBMS I create a unique index on column OBJECT_ID on table ‘t’ below and do 
a simple query that can be covered by the index without touching the base table

 

1> select count(1) from t where OBJECT_ID < 100

2> go

 

QUERY PLAN FOR STATEMENT 1 (at line 1).

 

 

STEP 1

The type of query is EXECUTE.

Executing a newly cached statement (SSQL_ID = 312036659).

 

Total estimated I/O cost for statement 1 (at line 1): 0.

 

 

QUERY PLAN FOR STATEMENT 1 (at line 0).

 

 

STEP 1

The type of query is DECLARE.

 

Total estimated I/O cost for statement 1 (at line 0): 0.

 

 

QUERY PLAN FOR STATEMENT 2 (at line 1).

Optimized using Parallel Mode

 

 

STEP 1

The type of query is SELECT.

 

3 operator(s) under root

 

   |ROOT:EMIT Operator (VA = 3)

   |

   |   |SCALAR AGGREGATE Operator (VA = 2)

   |   |  Evaluate Ungrouped COUNT AGGREGATE.

   |   |

   |   |   |RESTRICT Operator (VA = 1)(3)(0)(0)(0)(0)

   |   |   |

   |   |   |   |SCAN Operator (VA = 0)

   |   |   |   |  FROM TABLE

   |   |   |   |  t

   |   |   |   |  Using Clustered Index.

   |   |   |   |  Index : t_ui

   |   |   |   |  Forward Scan.

   |   |   |   |  Positioning by key.

   |   |   |   |  Index contains all needed columns. Base table will not be 
read.

   |   |   |   |  Keys are:

   |   |   |   |OBJECT_ID ASC

   |   |   |   |  Using I/O Size 64 Kbytes for index leaf pages.

   |   |   |   |  With LRU Buffer Replacement Strategy for index leaf pages.

 

 

Total estimated I/O cost for statement 2 (at line 1): 322792.

 

 

 

OK so no base table is touched

 

Let us do similar thing by creating an index on OBJECT_ID in that ‘t’ table 
imported from the said table and creaed in Hive

 

 

create index t_ui on table t (object_id) as 'COMPACT' WITH DEFERRED REBUILD;

alter index t_ui on t rebuild;

analyze table t compute statistics;

 

 

I am going to run the same query in Hive. However, I only see a table scan 
below and no mention of that index. May be I am missing something here?

 

0: jdbc:hive2://rhes564:10010/default> explain select count(1) from t where 
OBJECT_ID < 100;

+--+--+

| Explain   
   |

+--+--+

| STAGE DEPENDENCIES:   
   |

|   Stage-1 is a root stage 
   |

|   Stage-0 depends on stages: Stage-1  
   |

|   
   |

| STAGE PLANS:  
   |

|   Stage: Stage-1  
   |

| Spark 
   |

|   Edges:  
   |

| Reducer 2 <- Map 1 (GROUP, 1) 
   |

|   DagName: hduser_20160105203204_8d987e9a-415a-476a-8bad-b9a5010e36bf:54  
   |

|   Vertices:   
   |

| Map 1 
   |

| Map Operator Tree:
   |

| TableScan 
   |

|   alias: t
   |

|   Statistics: Num rows: 2074897 Data size: 64438212 Basic 
stats: COMPLETE Column stats: NONE |

|   Filter Operator 
   |

| predicate: (object_id < 1

Re: Is Hive Index officially not recommended?

2016-01-05 Thread Ting(Goden) Yao
yes. we tried mr and it works fine. so it's more likely a tez issue.
Thanks for your comments.

On Tue, Jan 5, 2016 at 11:58 AM Jörn Franke  wrote:

> You can still use execution Engine mr for maintaining the index. Indeed
> with the ORC or parquet format there are min/max indexes and bloom filters,
> but you need to sort your data appropriately to benefit from performance.
> Alternatively you can create redundant tables sorted in different order.
> The "traditional" indexes can still make sense for data not in Orc or
> parquet format.
> Keep in mind that for warehouse scenarios there are many other
> optimization methods in Hive.
>
> On 05 Jan 2016, at 19:17, Ting(Goden) Yao  wrote:
>
> Hi,
>
> We hit an issue when doing Hive testing to rebuild index on Tez.
> We were told by our Hadoop distro vendor that it's not recommended (or
> should avoid) using index with Hive.
>
> But I don't see an official message on Hive wiki
>  or
> documentation.
> Can someone confirm that so we'll ask our users to avoid indexing.
>
> Thanks.
> -Goden
>
> ==Exceptions (if you're interested in details) ==
>
> Exception:
>
> 2015-12-08 22:55:30,263 FATAL [AsyncDispatcher event handler] 
> event.AsyncDispatcher: Error in dispatcher thread
> org.apache.tez.dag.api.TezUncheckedException: Unable to instantiate class 
> with 1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator
> at 
> org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:80)
> at 
> org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:98)
> at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager.createInitializer(RootInputInitializerManager.java:137)
> at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager.runInputInitializers(RootInputInitializerManager.java:114)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.setupInputInitializerManager(VertexImpl.java:3943)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.access$3900(VertexImpl.java:180)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:2956)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2906)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2887)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1556)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:179)
> at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1764)
> at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1750)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:69)
> ... 20 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.initialize(DynamicPartitionPruner.java:154)
> at 
> org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.(DynamicPartitionPruner.java:110)
> at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.(HiveSplitGenerator.java:95)
> ... 25 more
> 2015-12-08 22:55:30,266 ERROR [AsyncDispatcher event handler] 
> impl.VertexImpl: Can't handle Invalid event V_START on vertex Map 1 with 
> vertexId vertex_1449613300943_0002_1_00 at current state NEW
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> V_START at NEW
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$Int

Re: Is Hive Index officially not recommended?

2016-01-05 Thread Jörn Franke
Btw this is not Hive specific, but also for other relational database systems, 
such as Oracle Exadata.

> On 05 Jan 2016, at 20:57, Jörn Franke  wrote:
> 
> You can still use execution Engine mr for maintaining the index. Indeed with 
> the ORC or parquet format there are min/max indexes and bloom filters, but 
> you need to sort your data appropriately to benefit from performance. 
> Alternatively you can create redundant tables sorted in different order.
> The "traditional" indexes can still make sense for data not in Orc or parquet 
> format.
> Keep in mind that for warehouse scenarios there are many other optimization 
> methods in Hive.
> 
>> On 05 Jan 2016, at 19:17, Ting(Goden) Yao  wrote:
>> 
>> Hi,
>> 
>> We hit an issue when doing Hive testing to rebuild index on Tez.
>> We were told by our Hadoop distro vendor that it's not recommended (or 
>> should avoid) using index with Hive.
>> 
>> But I don't see an official message on Hive wiki or documentation.
>> Can someone confirm that so we'll ask our users to avoid indexing.
>> 
>> Thanks.
>> -Goden
>> 
>> ==Exceptions (if you're interested in details) ==
>> Exception:
>> 
>> 2015-12-08 22:55:30,263 FATAL [AsyncDispatcher event handler] 
>> event.AsyncDispatcher: Error in dispatcher thread
>> org.apache.tez.dag.api.TezUncheckedException: Unable to instantiate class 
>> with 1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator
>> at 
>> org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:80)
>> at 
>> org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:98)
>> at 
>> org.apache.tez.dag.app.dag.RootInputInitializerManager.createInitializer(RootInputInitializerManager.java:137)
>> at 
>> org.apache.tez.dag.app.dag.RootInputInitializerManager.runInputInitializers(RootInputInitializerManager.java:114)
>> at 
>> org.apache.tez.dag.app.dag.impl.VertexImpl.setupInputInitializerManager(VertexImpl.java:3943)
>> at 
>> org.apache.tez.dag.app.dag.impl.VertexImpl.access$3900(VertexImpl.java:180)
>> at 
>> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:2956)
>> at 
>> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2906)
>> at 
>> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2887)
>> at 
>> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>> at 
>> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>> at 
>> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>> at 
>> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>> at 
>> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
>> at 
>> org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1556)
>> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:179)
>> at 
>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1764)
>> at 
>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1750)
>> at 
>> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>> at 
>> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at 
>> org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:69)
>> ... 20 more
>> Caused by: java.lang.NullPointerException
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.initialize(DynamicPartitionPruner.java:154)
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.(DynamicPartitionPruner.java:110)
>> at 
>> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.(HiveSplitGenerator.java:95)
>> ... 25 more
>> 2015-12-08 22:55:30,266 ERROR [AsyncDispatcher event handler] 
>> impl.VertexImpl: Can't handle Invalid event V_START on vertex Map 1 with 
>> vertexId vertex_1449613300943_0002_1_00 at current state NEW
>> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
>> V_START at NEW
>> at 
>> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>> at 
>> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>> at 
>> org.a

Re: Is Hive Index officially not recommended?

2016-01-05 Thread Jörn Franke
You can still use execution Engine mr for maintaining the index. Indeed with 
the ORC or parquet format there are min/max indexes and bloom filters, but you 
need to sort your data appropriately to benefit from performance. Alternatively 
you can create redundant tables sorted in different order.
The "traditional" indexes can still make sense for data not in Orc or parquet 
format.
Keep in mind that for warehouse scenarios there are many other optimization 
methods in Hive.

> On 05 Jan 2016, at 19:17, Ting(Goden) Yao  wrote:
> 
> Hi,
> 
> We hit an issue when doing Hive testing to rebuild index on Tez.
> We were told by our Hadoop distro vendor that it's not recommended (or should 
> avoid) using index with Hive.
> 
> But I don't see an official message on Hive wiki or documentation.
> Can someone confirm that so we'll ask our users to avoid indexing.
> 
> Thanks.
> -Goden
> 
> ==Exceptions (if you're interested in details) ==
> Exception:
> 
> 2015-12-08 22:55:30,263 FATAL [AsyncDispatcher event handler] 
> event.AsyncDispatcher: Error in dispatcher thread
> org.apache.tez.dag.api.TezUncheckedException: Unable to instantiate class 
> with 1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator
> at 
> org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:80)
> at 
> org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:98)
> at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager.createInitializer(RootInputInitializerManager.java:137)
> at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager.runInputInitializers(RootInputInitializerManager.java:114)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.setupInputInitializerManager(VertexImpl.java:3943)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl.access$3900(VertexImpl.java:180)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:2956)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2906)
> at 
> org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2887)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1556)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:179)
> at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1764)
> at 
> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1750)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:69)
> ... 20 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.initialize(DynamicPartitionPruner.java:154)
> at 
> org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.(DynamicPartitionPruner.java:110)
> at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.(HiveSplitGenerator.java:95)
> ... 25 more
> 2015-12-08 22:55:30,266 ERROR [AsyncDispatcher event handler] 
> impl.VertexImpl: Can't handle Invalid event V_START on vertex Map 1 with 
> vertexId vertex_1449613300943_0002_1_00 at current state NEW
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> V_START at NEW
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.ja

RE: Is Hive Index officially not recommended?

2016-01-05 Thread Mich Talebzadeh
I don’t think Index on hive (as a separate entity) adds any value  although you 
can create one 

 

You can create an ORC table which will have characteristics that can simulate 
index like behaviour

 

CLUSTERED BY (object_id) INTO 256 BUCKETS

STORED AS ORC

TBLPROPERTIES ( "orc.compress"="SNAPPY",

"orc.create.index"="true",

"orc.bloom.filter.columns"="object_id",

 

That improves query response

 

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Ting(Goden) Yao [mailto:t...@pivotal.io] 
Sent: 05 January 2016 18:18
To: user@hive.apache.org
Subject: Is Hive Index officially not recommended?

 

Hi,

 

We hit an issue when doing Hive testing to rebuild index on Tez.

We were told by our Hadoop distro vendor that it's not recommended (or should 
avoid) using index with Hive.

 

But I don't see an official message on Hive wiki 
  or documentation.

Can someone confirm that so we'll ask our users to avoid indexing.

 

Thanks.

-Goden

 

==Exceptions (if you're interested in details) ==

Exception:

2015-12-08 22:55:30,263 FATAL [AsyncDispatcher event handler] 
event.AsyncDispatcher: Error in dispatcher thread
org.apache.tez.dag.api.TezUncheckedException: Unable to instantiate class with 
1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator
at 
org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:80)
at 
org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:98)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.createInitializer(RootInputInitializerManager.java:137)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.runInputInitializers(RootInputInitializerManager.java:114)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl.setupInputInitializerManager(VertexImpl.java:3943)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl.access$3900(VertexImpl.java:180)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:2956)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2906)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2887)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1556)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:179)
at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1764)
at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1750)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.re

Is Hive Index officially not recommended?

2016-01-05 Thread Ting(Goden) Yao
Hi,

We hit an issue when doing Hive testing to rebuild index on Tez.
We were told by our Hadoop distro vendor that it's not recommended (or
should avoid) using index with Hive.

But I don't see an official message on Hive wiki
 or
documentation.
Can someone confirm that so we'll ask our users to avoid indexing.

Thanks.
-Goden

==Exceptions (if you're interested in details) ==

Exception:

2015-12-08 22:55:30,263 FATAL [AsyncDispatcher event handler]
event.AsyncDispatcher: Error in dispatcher thread
org.apache.tez.dag.api.TezUncheckedException: Unable to instantiate
class with 1 arguments:
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator
at 
org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:80)
at 
org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:98)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.createInitializer(RootInputInitializerManager.java:137)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager.runInputInitializers(RootInputInitializerManager.java:114)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl.setupInputInitializerManager(VertexImpl.java:3943)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl.access$3900(VertexImpl.java:180)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:2956)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2906)
at 
org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:2887)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1556)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:179)
at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1764)
at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1750)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:69)
... 20 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.initialize(DynamicPartitionPruner.java:154)
at 
org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.(DynamicPartitionPruner.java:110)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.(HiveSplitGenerator.java:95)
... 25 more
2015-12-08 22:55:30,266 ERROR [AsyncDispatcher event handler]
impl.VertexImpl: Can't handle Invalid event V_START on vertex Map 1
with vertexId vertex_1449613300943_0002_1_00 at current state NEW
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid
event: V_START at NEW
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1556)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:179)
at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1764)
at 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1750)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
2015-12-08 22:55:30,267 ERROR [AsyncDispatcher event handler]
impl.VertexImpl: Invalid event V_INTERNAL_ERROR on Vert


Re: NPE when reading Parquet using Hive on Tez

2016-01-05 Thread Adam Hunt
Hi Gopal,

Spark does offer dynamic allocation, but it doesn't always work as
advertised. My experience with Tez has been more in line with my
expectations. I'll bring up my issues with Spark on that list.

I tried your example and got the same NPE. It might be a mapr-hive issue.
Thanks for your help.

Adam

On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan 
wrote:

>
> > select count(*) from alexa_parquet;
>
> > Caused by: java.lang.NullPointerException
> >at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
> >ze(TypeInfoUtils.java:274)
> >at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.
> >(TypeInfoUtils.java:293)
> >at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
> >tring(TypeInfoUtils.java:764)
> >at
> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
> >nTypes(DataWritableReadSupport.java:76)
> >at
> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
> >aWritableReadSupport.java:220)
> >at
> >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
> >lit(ParquetRecordReaderWrapper.java:256)
>
> This might be an NPE triggered off by a specific case of the type parser.
>
> I tested it out on my current build with simple types and it looks like
> the issue needs more detail on the column types for a repro.
>
> hive> create temporary table x (x int) stored as parquet;
> hive> insert into x values(1),(2);
> hive> select count(*) from x where x.x > 1;
> Status: DAG finished successfully in 0.18 seconds
> OK
> 1
> Time taken: 0.792 seconds, Fetched: 1 row(s)
> hive>
>
> Do you have INT96 in the schema?
>
> > I'm currently evaluating Hive on Tez as an alternative to keeping the
> >SparkSQL thrift sever running all the time locking up resources.
>
> Tez has a tunable value in tez.am.session.min.held-containers (i.e
> something small like 10).
>
> And HiveServer2 can be made work similarly because spark
> HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.
>
>
>
>
>
>
> Cheers,
> Gopal
>
>
>


RE: Deleting empty rows from hive table through java

2016-01-05 Thread Mich Talebzadeh
Agreed.

 

Empty rows in any database have no intrinsic value. If we think of ELT, then in 
theory we need to get the Web data into Hive table including empty rows and 
then do the clean-up and getting rid of them. This is time consuming and 
whatever engine we use it is not going to be efficient. I have a text shell 
that generates a simple table with ID column and a description column as a 
random text column. Pretty simple code. However, it inserts one normal row 
followed by a blank row into a hive table that creates itself

 

#!/bin/ksh

function genrandom

{

l=$1

[ "$l" == "" ] && l=50

tr -dc A-Za-z0-9_ < /dev/urandom | head -c ${l} | xargs

}

#

# Main Section

#

fi

FILE_NAME=`basename $0 .ksh`

#

IN_FILE="/var/tmp/test.hql"

[ -f ${IN_FILE} ] && rm -f ${IN_FILE}

LOG_FILE="/var/tmp/test.log"

[ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}

cat >> ${IN_FILE} << !

set hive.exec.dynamic.partition=true;

set hive.exec.max.dynamic.partitions.pernode=1000;

set hive.enforce.bucketing = true;

use test;  -- That is a test database change it to whatever you like

DROP TABLE IF EXISTS txtest;

CREATE TABLE txtest (

  id int

, description string

)

CLUSTERED BY (id) INTO 256 BUCKETS

STORED AS ORC TBLPROPERTIES('transactional'='true')

;

INSERT INTO TABLE txtest VALUES

!

ROWS=20

integer ROWCOUNT=1

while ((ROWCOUNT <= ROWS))

do

   NEW_UUID=`genrandom 50`  ## generate 50 character random string

   if  ((ROWCOUNT < ROWS))

   then

 COLON=","

   else

 COLON=""

   fi

   echo "(${ROWCOUNT},'${NEW_UUID}') ${COLON}" >> ${IN_FILE}

   if  ((ROWCOUNT < ROWS))

   then

#

## generate an empty line

#

echo "('','') ${COLON}" >> ${IN_FILE}

   fi

   ((ROWCOUNT = ROWCOUNT + 1))

done

#

cat >> ${IN_FILE} << !

;

select * from txtest;

!exit

 

Now run that test.hql script against your hive and then try to delete empty rows

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from test.txtest;

INFO  : Status: Finished successfully in 69.32 seconds

+--+--+

| _c0  |

+--+--+

| 39   |

+--+--+

1 row selected (73.368 seconds)

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from test.txtest where 
id is null;

INFO  :

INFO  : Status: Finished successfully in 7.04 seconds

+--+--+

| _c0  |

+--+--+

| 19   |

+--+--+

1 row selected (7.151 seconds)

 

0: jdbc:hive2://rhes564:10010/default> delete from test.txtest where id is null;

INFO  :

Query Hive on Spark job[2] stages:

INFO  : 5

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[2])

INFO  : Status: Finished successfully in 15.08 seconds

INFO  : Loading data to table test.txtest from 
hdfs://rhes564:9000/user/hive/warehouse/test.db/txtest/.hive-staging_hive_2016-01-05_16-08-56_808_8837275301497077898-13/-ext-1

INFO  : Table test.txtest stats: [numFiles=257, numRows=20, totalSize=65884, 
rawDataSize=0]

No rows affected (15.696 seconds)

 

 

Ok so it took 15.6 seconds to delete those 19 empty rows. One can easily do 
that by getting the empty lines at OS level before putting data into Hive table!

 

cat test.hql | grep -v "('','')" > tmp.$$

mv -f tmp.$$ test.hql

 

 

So like most things there is really no clear cut approach whether you want to 
do it outside of Hive or in Hive.

 

HTH

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Vikas Parashar [mailto:para.vi...@gmail.com] 
Sent: 05 January 2016 11:40
To: user@hive.apache.org
Subject: Re: Deleting empty rows from hive table through java

 

Well said Mich,

 

I had gone through from the same scenario in which we had done ETL out side the 
hive. Once the transformation is done 

Re: Hive on TEZ fails starting

2016-01-05 Thread Rajesh Balamohan
Try '  beeline --hiveconf tez.task.launch.env="LD_LIBRARY_PATH=$LD_LIBRARY_
PATH:$HADOOP_COMMON_HOME/lib/native" --hiveconf tez.am.launch.env="
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native" '.  Please
check if you have the lib*.so available in the native folder (or point it
to the folder which contains the so files)

~Rajesh.B


On Tue, Jan 5, 2016 at 4:00 PM, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> I have added the following to the LD_LIBRARY_PATH and JAVA_LIBRARY_PATH
>
>
>
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native
>
> export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native
>
>
>
> Trying to use TEZ, I still get the same error
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=tez;
>
> No rows affected (0.002 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> use oraclehadoop;
>
> No rows affected (0.019 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select count(1) from sales;
>
> INFO  : Tez session hasn't been created yet. Opening session
>
> INFO  :
>
>
>
> INFO  : Status: Running (Executing on YARN cluster with App id
> application_1451986680090_0002)
>
>
>
> INFO  : Map 1: -/-  Reducer 2: 0/1
>
> INFO  : Map 1: 0/1  Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1
>
> INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1
>
>
>
> ERROR : Vertex failed, vertexName=Map 1,
> vertexId=vertex_1451986680090_0002_1_00, diagnostics=[Task failed,
> taskId=task_1451986680090_0002_1_00_00, diagnostics=[TaskAttempt 0
> failed, info=[Error: Failure while running task:java.lang.RuntimeException:
> java.lang.UnsatisfiedLinkError:
> org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com]
> *Sent:* 05 January 2016 00:35
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on TEZ fails starting
>
>
>
> By default it should add "LD_LIBRARY_PATH" in the container (ref:
> https://github.com/apache/tez/blob/abfc8bfb0a8620d31697a31ad516674a8d3f9f7c/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java#L358)
> . In case your snappy native libs are present elsewhere in the cluster
> deployment, you can override using "tez.task.launch.env" and
> "tez.am.launch.env" (refer:
> https://github.com/apache/tez/blob/abfc8bfb0a8620d31697a31ad516674a8d3f9f7c/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java#L412
> )
>
>
>
> ~Rajesh.B
>
>
>
>
>



-- 
~Rajesh.B


Re: Deleting empty rows from hive table through java

2016-01-05 Thread Vikas Parashar
Well said Mich,

I had gone through from the same scenario in which we had done ETL out side
the hive. Once the transformation is done then we loaded all data into hive
warehouse. I think, that's the best practice, we should follow it.

Regards,
Vikas Parashar

On Tue, Jan 5, 2016 at 5:02 PM, Mich Talebzadeh  wrote:

> In would be interesting to do ETL outside of Hive by getting Data from
> Webpage to an intermediate file, pruning the empty rows and loading the
> final CSV file into Hive destination table.
>
>
>
> I am pretty sure this clean up outside of Hive would be faster compared to
> said thing in Hive
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 05 January 2016 08:55
> *To:* user@hive.apache.org
> *Subject:* RE: Deleting empty rows from hive table through java
>
>
>
> Hi Sateesh,
>
>
>
> You can do the clean-up in Hive by creating a staging table in Hive,
> feeding your CSV data there and then inserting data into main table where
> COL1 is NOT NULL.
>
>
>
> Alternatively you can create your Hive table as transactional. Although I
> would say the staging table is better as you will have a full record of
> your CSV data at any time.
>
>
>
> You can of course do the pruning of data outside of Hive using a simple
> shell script with sed and awk (if you are familiar with those tools).
>
>
>
> cat CSV_FILE | '|sed -e '/^$/d'
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Sateesh Karuturi [mailto:sateesh.karutu...@gmail.com
> ]
> *Sent:* 05 January 2016 06:59
> *To:* user@hive.apache.org
> *Subject:* Deleting empty rows from hive table through java
>
>
>
> Hello...
>
> Anyone please help me how to delete empty rows from hive table through
> java?
>
> Thanks in advance
>


RE: Deleting empty rows from hive table through java

2016-01-05 Thread Mich Talebzadeh
In would be interesting to do ETL outside of Hive by getting Data from Webpage 
to an intermediate file, pruning the empty rows and loading the final CSV file 
into Hive destination table.

 

I am pretty sure this clean up outside of Hive would be faster compared to said 
thing in Hive

 

Dr Mich Talebzadeh

 

LinkedIn   

 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 05 January 2016 08:55
To: user@hive.apache.org
Subject: RE: Deleting empty rows from hive table through java

 

Hi Sateesh,

 

You can do the clean-up in Hive by creating a staging table in Hive, feeding 
your CSV data there and then inserting data into main table where COL1 is NOT 
NULL.

 

Alternatively you can create your Hive table as transactional. Although I would 
say the staging table is better as you will have a full record of your CSV data 
at any time.

 

You can of course do the pruning of data outside of Hive using a simple shell 
script with sed and awk (if you are familiar with those tools).

 

cat CSV_FILE | '|sed -e '/^$/d'

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn   

 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Sateesh Karuturi [mailto:sateesh.karutu...@gmail.com] 
Sent: 05 January 2016 06:59
To: user@hive.apache.org  
Subject: Deleting empty rows from hive table through java

 

Hello...

Anyone please help me how to delete empty rows from hive table through java?

Thanks in advance



RE: Hive on TEZ fails starting

2016-01-05 Thread Mich Talebzadeh
Hi,

 

I have added the following to the LD_LIBRARY_PATH and JAVA_LIBRARY_PATH

 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native

 

Trying to use TEZ, I still get the same error

 

0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=tez;

No rows affected (0.002 seconds)

0: jdbc:hive2://rhes564:10010/default> use oraclehadoop;

No rows affected (0.019 seconds)

0: jdbc:hive2://rhes564:10010/default> select count(1) from sales;

INFO  : Tez session hasn't been created yet. Opening session

INFO  :

 

INFO  : Status: Running (Executing on YARN cluster with App id 
application_1451986680090_0002)

 

INFO  : Map 1: -/-  Reducer 2: 0/1

INFO  : Map 1: 0/1  Reducer 2: 0/1

INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1

INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-1)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-2)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1

INFO  : Map 1: 0(+1,-3)/1   Reducer 2: 0/1

 

ERROR : Vertex failed, vertexName=Map 1, 
vertexId=vertex_1451986680090_0002_1_00, diagnostics=[Task failed, 
taskId=task_1451986680090_0002_1_00_00, diagnostics=[TaskAttempt 0 failed, 
info=[Error: Failure while running task:java.lang.RuntimeException: 
java.lang.UnsatisfiedLinkError: 
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Rajesh Balamohan [mailto:rajesh.balamo...@gmail.com] 
Sent: 05 January 2016 00:35
To: user@hive.apache.org
Subject: Re: Hive on TEZ fails starting

 

By default it should add "LD_LIBRARY_PATH" in the container (ref: 
https://github.com/apache/tez/blob/abfc8bfb0a8620d31697a31ad516674a8d3f9f7c/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java#L358)
 . In case your snappy native libs are present elsewhere in the cluster 
deployment, you can override using "tez.task.launch.env" and 
"tez.am.launch.env" (refer: 
https://github.com/apache/tez/blob/abfc8bfb0a8620d31697a31ad516674a8d3f9f7c/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java#L412)

 

~Rajesh.B

 

 



RE: Deleting empty rows from hive table through java

2016-01-05 Thread Mich Talebzadeh
Hi Sateesh,

 

You can do the clean-up in Hive by creating a staging table in Hive, feeding 
your CSV data there and then inserting data into main table where COL1 is NOT 
NULL.

 

Alternatively you can create your Hive table as transactional. Although I would 
say the staging table is better as you will have a full record of your CSV data 
at any time.

 

You can of course do the pruning of data outside of Hive using a simple shell 
script with sed and awk (if you are familiar with those tools).

 

cat CSV_FILE | '|sed -e '/^$/d'

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn   

 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Sateesh Karuturi [mailto:sateesh.karutu...@gmail.com] 
Sent: 05 January 2016 06:59
To: user@hive.apache.org
Subject: Deleting empty rows from hive table through java

 

Hello...

Anyone please help me how to delete empty rows from hive table through java?

Thanks in advance



Re: Deleting empty rows from hive table through java

2016-01-05 Thread Vikas Parashar
If data is not huge then please export it into csv. You have to do all the
transformation on csv and point your table on it.
Would you mind telling me how you are loading your data in hive.



Regards,
Vikas Parashar

On Tue, Jan 5, 2016 at 1:46 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> Thank you for your quick response...
> Directly loading the data from webpage to hive
>
> On Tue, Jan 5, 2016 at 1:44 PM, Vikas Parashar 
> wrote:
>
>> What is the backend of your table?
>> Is it csv, orc or anything else!
>>
>>
>> Regards,
>> Vikas Parashar
>>
>>
>> On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
>> sateesh.karutu...@gmail.com> wrote:
>>
>>> Hello...
>>> Anyone please help me how to delete empty rows from hive table through
>>> java?
>>> Thanks in advance
>>>
>>
>>
>


Re: Deleting empty rows from hive table through java

2016-01-05 Thread Sateesh Karuturi
Thank you for your quick response...
Directly loading the data from webpage to hive

On Tue, Jan 5, 2016 at 1:44 PM, Vikas Parashar  wrote:

> What is the backend of your table?
> Is it csv, orc or anything else!
>
>
> Regards,
> Vikas Parashar
>
>
> On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
> sateesh.karutu...@gmail.com> wrote:
>
>> Hello...
>> Anyone please help me how to delete empty rows from hive table through
>> java?
>> Thanks in advance
>>
>
>


Re: Deleting empty rows from hive table through java

2016-01-05 Thread Vikas Parashar
What is the backend of your table?
Is it csv, orc or anything else!


Regards,
Vikas Parashar


On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> Hello...
> Anyone please help me how to delete empty rows from hive table through
> java?
> Thanks in advance
>