RE: make best use of VCore in Hive

2016-03-28 Thread Ryan Harris
In my opinion, this ultimately becomes a resource balance issue that you'll 
need to test.

You have a fixed amount of memory (although you haven't said what it is). As 
you increase the number of tasks, the available memory per task will decrease.  
If the tasks run out of memory, they will either page to disk or fail.

If they fail you'll be forced to increase available memory until you figure out 
how much memory each task needs to succeed...that will become the limiting 
factor on how many tasks you can run.

Hive is generally good about falling back and not failing, but performance may 
suffer.

But given the situation you describe, I'd generally recommend increasing the 
number of tasks and monitoring to ensure that actually improves the performance.

-Original Message-
From: mahender bigdata [mailto:mahender.bigd...@outlook.com] 
Sent: Monday, March 28, 2016 6:04 PM
To: user@hive.apache.org
Subject: make best use of VCore in Hive

Hi,

Currently we are doing join 2-3 big tables and couple of Left Joins. We 
are running on 40 node cluster, During query execution, we could see all 
the memory has been utilized completely (100%), which is perfect. But 
Number of VCore used are less than 50%. Is there a way to increase usage 
of number of cores to almost 90-100% . Can setting Number of Tasks for 
mapper and reduce with TEZ execution makes effectively VCore ?. Any 
guidance on VCore maximum utilization.



==
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.


make best use of VCore in Hive

2016-03-28 Thread mahender bigdata

Hi,

Currently we are doing join 2-3 big tables and couple of Left Joins. We 
are running on 40 node cluster, During query execution, we could see all 
the memory has been utilized completely (100%), which is perfect. But 
Number of VCore used are less than 50%. Is there a way to increase usage 
of number of cores to almost 90-100% . Can setting Number of Tasks for 
mapper and reduce with TEZ execution makes effectively VCore ?. Any 
guidance on VCore maximum utilization.





RE: Best way of Unpivoting of hiva table data. Any Analytic function for unpivoting

2016-03-28 Thread Ryan Harris
collect_list(col) will give you an array with all of the data from that column
However, the scalability of this approach will have limits.

-Original Message-
From: mahender bigdata [mailto:mahender.bigd...@outlook.com] 
Sent: Monday, March 28, 2016 5:47 PM
To: user@hive.apache.org
Subject: Best way of Unpivoting of hiva table data. Any Analytic function for 
unpivoting

Hi,

Has any one implemented Unpivoting of Hive external table data. We would 
like Convert Columns into Multiple Rows. We have external table, which 
holds almost 2 GB of Data. is there best and quicker way of Converting 
columns into Row. Any Analytic functions available in Hive to do Unpivoting.



==
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.


Best way of Unpivoting of hiva table data. Any Analytic function for unpivoting

2016-03-28 Thread mahender bigdata

Hi,

Has any one implemented Unpivoting of Hive external table data. We would 
like Convert Columns into Multiple Rows. We have external table, which 
holds almost 2 GB of Data. is there best and quicker way of Converting 
columns into Row. Any Analytic functions available in Hive to do Unpivoting.





Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Andrew Sears
It would be useful to have a script that could be scheduled as part of a low 
priority background job, to update stats at least where none are available, and 
a report in the Hive GUI on stats per table.


Encountered a Tez oo memory issue due to the lack of auto updated stats 
recently.
Cheers, Andrew

On Mon, Mar 28, 2016 at 2:27 PM, Mich Talebzadeh < mich.talebza...@gmail.com 
[mich.talebza...@gmail.com] > wrote:
Hi Alan,
Thanks for the clarification. I gather you are referring to the following notes 
in Jira
"Given the work that's going on in HIVE-11160 
[https://issues.apache.org/jira/browse/HIVE-11160] and HIVE-12763 
[https://issues.apache.org/jira/browse/HIVE-12763] I don't think it makes sense 
to continue down this path. These JIRAs will lay the groundwork for 
auto-gathering stats on data as it is inserted rather than having a background 
process do the work."
I concur that I am not a fan of automatic update statistics although many RDBMS 
vendor were touting about it in earlier days. The whole thing turned up to be a 
hindrance as UPDATE STATISTICS was being fired in the middle of the business 
day thus adding issues to the workload by taking resources away.
Most vendors base the need for update/gathering stats on the number of rows 
being changed by relying on some Function say datachange(). When datachange() 
function indicates changes by 10% so it is time for update stats to run. Again 
in my opinion rather arbitrary and void of any scientific base. For Hive the 
important one is Inserts. For transactional tables one will have Updates and 
Deletes as well. My understanding is that the classical approach is to report 
on how many "row change operations" say Inserts have been performed since the 
last time any kind of analyze statistics was run.

This came to my mind as I was using Spark to load CSV files and create and 
insert in Hive ORC tables. The problem I have is that Analyse statistics 
through Spark fails. This is not a show stopper as the load shell script 
invokes beeline to log in to Hive and Analyze statistics on the newly created 
table. Although some proponents might argue about saving data in Spark as 
Parquet file, when one has millions and millions of rows then stats matter and 
then ORC adds its value.




Cheers


Dr Mich Talebzadeh



LinkedIn 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 
[https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw]



http://talebzadehmich.wordpress.com [http://talebzadehmich.wordpress.com/]




On 28 March 2016 at 18:43, Alan Gates < alanfga...@gmail.com 
[alanfga...@gmail.com] > wrote:
I resolved that as Won’t Fix. See the last comment on the JIRA for my rationale.

Alan.

> On Mar 28, 2016, at 03:53, Mich Talebzadeh < mich.talebza...@gmail.com 
> [mich.talebza...@gmail.com] > wrote:
>
> Thanks. This does not seem to be implemented although the Jira says resolved. 
> It also mentions the timestamp of the last update stats. I do not see it yet.
>
> Regards,
>
> Mich
>
> Dr Mich Talebzadeh
>
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> [https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw]
>
> http://talebzadehmich.wordpress.com [http://talebzadehmich.wordpress.com]
>
>
> On 28 March 2016 at 06:19, Gopal Vijayaraghavan < gop...@apache.org 
> [gop...@apache.org] > wrote:
>
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed on ORC tables
>
>
> https://issues.apache.org/jira/browse/HIVE-12669 
> [https://issues.apache.org/jira/browse/HIVE-12669]
>
> Cheers,
> Gopal
>
>
>

Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Mich Talebzadeh
Hi Alan,

Thanks for the clarification. I gather you are referring to the following
notes in Jira

"Given the work that's going on in HIVE-11160
 and HIVE-12763
 I don't think it makes
sense to continue down this path. These JIRAs will lay the groundwork for
auto-gathering stats on data as it is inserted rather than having a
background process do the work."

I concur that I am not a fan of automatic update statistics although many
RDBMS vendor were touting about it in earlier days. The whole thing turned
up to be a hindrance as UPDATE STATISTICS was being fired in the middle of
the business day thus adding issues to the workload by taking resources
away.

Most vendors base the need for update/gathering stats on the number of
 rows being changed by relying on some Function say datachange(). When
datachange()  function indicates changes by 10% so it is time for update
stats to run. Again in my opinion rather arbitrary and void of any
scientific base. For Hive the important one is Inserts. For transactional
tables one will have Updates and Deletes as well. My understanding is that
the classical approach is to report on how many "row change operations" say
Inserts have been performed since the last time any kind of analyze
statistics was run.

This came to my mind as I was using Spark to load CSV files and create and
insert in Hive ORC tables. The problem I have is that Analyse statistics
through Spark fails. This is not a show stopper as the load shell
script invokes beeline to log in to Hive and Analyze statistics on the
newly created table. Although some proponents might argue about saving data
in Spark as Parquet file, when one has millions and millions of rows then
stats matter and then ORC adds its value.


Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 March 2016 at 18:43, Alan Gates  wrote:

> I resolved that as Won’t Fix.  See the last comment on the JIRA for my
> rationale.
>
> Alan.
>
> > On Mar 28, 2016, at 03:53, Mich Talebzadeh 
> wrote:
> >
> > Thanks. This does not seem to be implemented although the Jira says
> resolved. It also mentions the timestamp of the last update stats. I do not
> see it yet.
> >
> > Regards,
> >
> > Mich
> >
> > Dr Mich Talebzadeh
> >
> > LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > On 28 March 2016 at 06:19, Gopal Vijayaraghavan 
> wrote:
> >
> > > This might be a bit far fetched but is there any plan for background
> > >ANALYZE STATISTICS to be performed  on ORC tables
> >
> >
> > https://issues.apache.org/jira/browse/HIVE-12669
> >
> > Cheers,
> > Gopal
> >
> >
> >
>
>


Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Alan Gates
I resolved that as Won’t Fix.  See the last comment on the JIRA for my 
rationale.

Alan.

> On Mar 28, 2016, at 03:53, Mich Talebzadeh  wrote:
> 
> Thanks. This does not seem to be implemented although the Jira says resolved. 
> It also mentions the timestamp of the last update stats. I do not see it yet.
> 
> Regards,
> 
> Mich
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
> On 28 March 2016 at 06:19, Gopal Vijayaraghavan  wrote:
> 
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed  on ORC tables
> 
> 
> https://issues.apache.org/jira/browse/HIVE-12669
> 
> Cheers,
> Gopal
> 
> 
> 



TRYING TO CONNECT TO METASTORE...ASSISTANCE REQUESTED

2016-03-28 Thread JOHN MILLER
localhost:/usr/local/hive# bin/hive --service metastore &
[5] 27950
root@localhost:/usr/local/hive# Starting Hive Metastore Server
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/local/hive/lib/hive-jdbc-2.0.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]
java.lang.IncompatibleClassChangeError: Implementing class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:191)
at
org.apache.hadoop.hive.shims.ShimLoader.createShim(ShimLoader.java:139)
at
org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:134)
at
org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:93)
at
org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:55)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:64)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:516)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:481)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:543)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
at
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5743)
at
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5738)
at
org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:5995)
at
org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:5922)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Exception in thread "main" java.lang.IncompatibleClassChangeError:
Implementing class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at 

Re: Automatic Update statistics on ORC tables in Hive

2016-03-28 Thread Mich Talebzadeh
Thanks. This does not seem to be implemented although the Jira says
resolved. It also mentions the timestamp of the last update stats. I do not
see it yet.

Regards,

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 March 2016 at 06:19, Gopal Vijayaraghavan  wrote:

>
> > This might be a bit far fetched but is there any plan for background
> >ANALYZE STATISTICS to be performed  on ORC tables
>
>
> https://issues.apache.org/jira/browse/HIVE-12669
>
> Cheers,
> Gopal
>
>
>


Re: What's the advised way to do groupby 2 attributes from a table with 1000 columns?

2016-03-28 Thread Gopal Vijayaraghavan
> I only need to query 3 columns,
...
> The source table is about 1PB.


Format of this table is extremely critical.

A columnar data format like ORC is recommended to avoid reading any other
columns when reading 3 out of 1000.

> Will it be advised to do a subquery first, and then send it to the
>aggregation of group by, so that we have smaller files sending to
>groupby? Not sure it Hive automatically takes care of this.

Hive does column projection after the first scan, so this should not be
necessary - if you do explain logical , you will see

hive> explain logical select l_shipmode, l_shipdate, sum(l_quantity) from
lineitem group by l_shipmode, l_shipdate;



LOGICAL PLAN:
lineitem 
  TableScan (TS_0)
alias: lineitem
Select Operator (SEL_1)
  expressions: l_shipdate (type: string), l_shipmode (type: string),
l_quantity (type: double)
  outputColumnNames: l_shipdate, l_shipmode, l_quantity
  Group By Operator (GBY_2)
aggregations: sum(l_quantity)
keys: l_shipdate (type: string), l_shipmode (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2


The SEL_1 showing the projection of the 3 columns out of all cols in
lineitem.

Cheers,
Gopal