Re: Lag function in Hive
does your table have column called rownum? I think From Philip's mail, it was just an example On Wed, Apr 11, 2012 at 11:13 AM, karanveer.si...@barclays.com wrote: When I try using rownum in my Hive QL query, I get: Invalid column reference rownum. Am I missing something here? Regards, Karan -Original Message- From: David Kulp [mailto:dk...@fiksu.com] Sent: 10 April 2012 20:15 To: user@hive.apache.org Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, karanveer.si...@barclays.com karanveer.si...@barclays.com wrote: Makes sense but is not the distribution across nodes for a chunk of records in that order. If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. Regards, Karan -Original Message- From: Philip Tromans [mailto:philip.j.trom...@gmail.com] Sent: 10 April 2012 19:48 To: user@hive.apache.org Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, karanveer.si...@barclays.com wrote: Hi, Is there something like a 'lag' function in HIVE? The requirement is to calculate difference for the same column for every 2 subsequent records. For example. Row, Column A, Column B 1, 10, 100 2, 20, 200 3, 30, 300 The result that I need should be like: Row, Column A, Column B, Result 1, 10, 100, NULL 2, 20, 200, 100 (200-100) 3, 30, 300, 100 (300-200) Rgds, Karan This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority. -- Nitin Pawar
RE: Lag function in Hive
Rob n all - I tried below and created the jar file. For adding jar to class path, I do following: hive add jar /users/unix/singhka/Analytics.jar; The above seems to have worked fine as I see the resource added but when I go ahead and create a function, I get the following error. Any ideas what the issue can be? hive create temporary function lag as 'com.example.hive.udf.Lag'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask Regards, -Original Message- From: Hamilton, Robert (Austin) [mailto:robert.hamil...@hp.com] Sent: 10 April 2012 20:32 To: user@hive.apache.org Subject: RE: Lag function in Hive You can write a custom UDF - Here is one that I have played around with, along with some test SQL. It comes with no warrantee :) Sorry I can't really share the test data, but hopefully you get the idea. To run, compile the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the hive command below. Note the distribute by and sort by are critical. Also the sub-select is just an artifice to make sure the UDF is running in the reducer (so that it is sorted). Maybe the hive experts can suggest a better way for that to work... # # use live clickstream test data from 2012-01-12 # hive -e add jar Analytics.jar; create temporary function lag as 'com.example.hive.udf.Lag'; select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id is not null distribute by session_id sort by session_id,hit_datetime_gmt ) X distribute by session_id limit 1000 Contents of Lag.java - package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public final class Lag extends UDF{ private int counter; private String last_key; private String lastGroup; private String return_value=; public String evaluate(String key, String groupKey){ if(groupKey==null){ this.last_key=null; }else if ( !groupKey.equalsIgnoreCase(this.lastGroup )) { this.last_key=null; } return_value=this.last_key; this.last_key = key; this.lastGroup=groupKey; return return_value; } } Result of test run: 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:37 NULL 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:59 2012-01-12 00:00:37 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:05 2012-01-12 00:00:59 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:07 2012-01-12 00:01:05 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:11 2012-01-12 00:01:07 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:12 2012-01-12 00:01:11 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:24 2012-01-12 00:01:12 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:32 2012-01-12 00:01:24 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:45 2012-01-12 00:01:32 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:48 2012-01-12 00:01:45 -Original Message- From: Philip Tromans [mailto:philip.j.trom...@gmail.com] Sent: Tuesday, April 10, 2012 9:18 AM To: user@hive.apache.org Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, karanveer.si...@barclays.com wrote: Hi, Is there something like a 'lag' function in HIVE? The requirement is to calculate difference for the same column for every 2 subsequent records. For example. Row, Column A, Column B 1, 10, 100 2, 20, 200 3, 30, 300 The result that I need should be like: Row, Column A, Column B, Result 1, 10, 100, NULL 2, 20, 200, 100 (200-100) 3, 30, 300, 100 (300-200) Rgds, Karan This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access
RE: Lag function in Hive
That's the whole problem rite, I am unable to create a unique column for my record rows within Hive. If that's there, I can get the lag functionality to work for me. I was hoping that ROWNUM will act like a pseudo column in Hive. Regards, From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: 11 April 2012 12:15 To: user@hive.apache.org Subject: Re: Lag function in Hive does your table have column called rownum? I think From Philip's mail, it was just an example On Wed, Apr 11, 2012 at 11:13 AM, karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote: When I try using rownum in my Hive QL query, I get: Invalid column reference rownum. Am I missing something here? Regards, Karan -Original Message- From: David Kulp [mailto:dk...@fiksu.commailto:dk...@fiksu.com] Sent: 10 April 2012 20:15 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, karanveer.si...@barclays.commailto:karanveer.si...@barclays.com karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote: Makes sense but is not the distribution across nodes for a chunk of records in that order. If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. Regards, Karan -Original Message- From: Philip Tromans [mailto:philip.j.trom...@gmail.commailto:philip.j.trom...@gmail.com] Sent: 10 April 2012 19:48 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote: Hi, Is there something like a 'lag' function in HIVE? The requirement is to calculate difference for the same column for every 2 subsequent records. For example. Row, Column A, Column B 1, 10, 100 2, 20, 200 3, 30, 300 The result that I need should be like: Row, Column A, Column B, Result 1, 10, 100, NULL 2, 20, 200, 100 (200-100) 3, 30, 300, 100 (300-200) Rgds, Karan This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority. -- Nitin Pawar This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party,
Does Hive supports EXISTS keyword in select query?
Hello all, I want to query like below in Hive: Select a.* FROM tblA a JOIN tblB b ON a.field1 = b.field1 where (a.field2 is null or not exists(select field2 from tblB where filed2 is not null) But I think Hive doesn't supports EXISTS keyword so how can I overcome this issue? Pls suggest me some solution to this. I just got this kind of situation where I need to implement some thing like EXISTS/NOT EXISTS -- Thanks and Regards, Bhavesh Shah
Re: Does Hive supports EXISTS keyword in select query?
Hi, Hive supports EXISTS via SEMI JOIN. Have a look at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Cheers, Phil. On 11 April 2012 13:59, Bhavesh Shah bhavesh25s...@gmail.com wrote: Hello all, I want to query like below in Hive: Select a.* FROM tblA a JOIN tblB b ON a.field1 = b.field1 where (a.field2 is null or not exists(select field2 from tblB where filed2 is not null) But I think Hive doesn't supports EXISTS keyword so how can I overcome this issue? Pls suggest me some solution to this. I just got this kind of situation where I need to implement some thing like EXISTS/NOT EXISTS -- Thanks and Regards, Bhavesh Shah
Re: Lag function in Hive
Hi Karan, The error you mentioned you get on creating the temporary function typically happens when there is a typo in the class name (com.example.hive.udf.Lag, in this case). Can you ensure that the jar was properly built and contains the Lag class in the com.example.hive.udf package? Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: karanveer singh karanveer.si...@barclays.com To: user@hive.apache.org Sent: Wednesday, April 11, 2012 4:15:59 AM Subject: RE: Lag function in Hive Rob n all - I tried below and created the jar file. For adding jar to class path, I do following: hive add jar /users/unix/singhka/Analytics.jar; The above seems to have worked fine as I see the resource added but when I go ahead and create a function, I get the following error. Any ideas what the issue can be? hive create temporary function lag as 'com.example.hive.udf.Lag'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask Regards, -Original Message- From: Hamilton, Robert (Austin) [mailto:robert.hamil...@hp.com] Sent: 10 April 2012 20:32 To: user@hive.apache.org Subject: RE: Lag function in Hive You can write a custom UDF - Here is one that I have played around with, along with some test SQL. It comes with no warrantee :) Sorry I can't really share the test data, but hopefully you get the idea. To run, compile the Lag class, jar it up into Analytics.jar, put the jar on the CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the hive command below. Note the distribute by and sort by are critical. Also the sub-select is just an artifice to make sure the UDF is running in the reducer (so that it is sorted). Maybe the hive experts can suggest a better way for that to work... # # use live clickstream test data from 2012-01-12 # hive -e add jar Analytics.jar; create temporary function lag as 'com.example.hive.udf.Lag'; select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) from (select session_id,hit_datetime_gmt from omni2 where visit_day='2012-01-12' and session_id is not null distribute by session_id sort by session_id,hit_datetime_gmt ) X distribute by session_id limit 1000 Contents of Lag.java - package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public final class Lag extends UDF{ private int counter; private String last_key; private String lastGroup; private String return_value=; public String evaluate(String key, String groupKey){ if(groupKey==null){ this.last_key=null; }else if ( !groupKey.equalsIgnoreCase(this.lastGroup )) { this.last_key=null; } return_value=this.last_key; this.last_key = key; this.lastGroup=groupKey; return return_value; } } Result of test run: 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:37 NULL 1326326437-26270601625187049522752846106448274394 2012-01-12 00:00:59 2012-01-12 00:00:37 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:05 2012-01-12 00:00:59 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:07 2012-01-12 00:01:05 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:11 2012-01-12 00:01:07 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:12 2012-01-12 00:01:11 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:24 2012-01-12 00:01:12 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:32 2012-01-12 00:01:24 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:45 2012-01-12 00:01:32 1326326437-26270601625187049522752846106448274394 2012-01-12 00:01:48 2012-01-12 00:01:45 -Original Message- From: Philip Tromans [mailto:philip.j.trom...@gmail.com] Sent: Tuesday, April 10, 2012 9:18 AM To: user@hive.apache.org Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, karanveer.si...@barclays.com wrote: Hi, Is there something like a 'lag' function in HIVE? The requirement is to calculate difference for the same column for every 2 subsequent records. For example. Row, Column A, Column B 1, 10, 100 2, 20, 200 3, 30, 300 The result
RE: Does Hive supports EXISTS keyword in select query?
SEMI is only for exist. Maybe you can try this Select a.* FROM tblA a left outer JOIN tblB b ON a.field1 = b.field1 where a.field2 is null or b.fild2 is null Best regards Ransom. -Original Message- From: Philip Tromans [mailto:philip.j.trom...@gmail.com] Sent: Wednesday, April 11, 2012 9:02 PM To: user@hive.apache.org Subject: Re: Does Hive supports EXISTS keyword in select query? Hi, Hive supports EXISTS via SEMI JOIN. Have a look at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Cheers, Phil. On 11 April 2012 13:59, Bhavesh Shah bhavesh25s...@gmail.com wrote: Hello all, I want to query like below in Hive: Select a.* FROM tblA a JOIN tblB b ON a.field1 = b.field1 where (a.field2 is null or not exists(select field2 from tblB where filed2 is not null) But I think Hive doesn't supports EXISTS keyword so how can I overcome this issue? Pls suggest me some solution to this. I just got this kind of situation where I need to implement some thing like EXISTS/NOT EXISTS -- Thanks and Regards, Bhavesh Shah
Thrift service + zklocks + time = death
Hey all, Running 0.7.1 We are using a combination of hive-thrift+ zookeeper for locking. All works well for a couple days until eventually we hit this condition. HiveServerException(message:Query returned non-zero code: 10, cause: FAILED: Error in acquiring locks: locks on the underlying objects cannot be acquired. retry after some time, errorCode:10, SQLState:42000) at org.apache.hadoop.hive.service.ThriftHive$execute_result.read(ThriftHive.java:1494) at org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:120) at org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:229) at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:52) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124) XXX XXX XXX at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145) XXX XXX at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149) at core.ScriptTemplate.run(ScriptTemplate.groovy:28) at core.ScriptTemplate$run.call(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124) at core.ScriptTemplateRunner.main(ScriptTemplateRunner.groovy:19) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:88) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233) at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1302) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:767) at groovy.lang.GroovyShell.runScriptOrMainOrTestOrRunnable(GroovyShell.java:273) at groovy.lang.GroovyShell.run(GroovyShell.java:229) at groovy.lang.GroovyShell.run(GroovyShell.java:159) at groovy.ui.GroovyMain.processOnce(GroovyMain.java:496) at groovy.ui.GroovyMain.run(GroovyMain.java:311) at groovy.ui.GroovyMain.process(GroovyMain.java:297) at groovy.ui.GroovyMain.processArgs(GroovyMain.java:112) at groovy.ui.GroovyMain.main(GroovyMain.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:108) at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130) } It seems that eventually the hivethrifts connection to ZK goes stale and then all queries fail with this message. Delta to event might be 3 days of 100 session. Any ideas? Thanks, Edward
Re: Thrift service + zklocks + time = death
BTW the problem is not tables being locked. On the server side acquireLocks is getting a null pointer. 32432495-OK 32432498:Error in acquireLocks: java.lang.NullPointerException 32432552-FAILED: Error in acquiring locks: locks on the underlying objects cannot be acquired. retry after some time 32432660-Hive history file=/tmp/hadoop/hive_job_log_hadoop_201204111032_691199436.txt 32432737-Hive history file=/tmp/hadoop/hive_job_log_hadoop_201204111032_1845073466.txt On Wed, Apr 11, 2012 at 10:29 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hey all, Running 0.7.1 We are using a combination of hive-thrift+ zookeeper for locking. All works well for a couple days until eventually we hit this condition. HiveServerException(message:Query returned non-zero code: 10, cause: FAILED: Error in acquiring locks: locks on the underlying objects cannot be acquired. retry after some time, errorCode:10, SQLState:42000) at org.apache.hadoop.hive.service.ThriftHive$execute_result.read(ThriftHive.java:1494) at org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:120) at org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:229) at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:52) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124) XXX XXX XXX at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145) XXX XXX at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149) at core.ScriptTemplate.run(ScriptTemplate.groovy:28) at core.ScriptTemplate$run.call(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124) at core.ScriptTemplateRunner.main(ScriptTemplateRunner.groovy:19) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:88) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233) at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1302) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:767) at groovy.lang.GroovyShell.runScriptOrMainOrTestOrRunnable(GroovyShell.java:273) at groovy.lang.GroovyShell.run(GroovyShell.java:229) at groovy.lang.GroovyShell.run(GroovyShell.java:159) at groovy.ui.GroovyMain.processOnce(GroovyMain.java:496) at groovy.ui.GroovyMain.run(GroovyMain.java:311) at groovy.ui.GroovyMain.process(GroovyMain.java:297) at groovy.ui.GroovyMain.processArgs(GroovyMain.java:112) at groovy.ui.GroovyMain.main(GroovyMain.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:108) at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130) } It seems that eventually the hivethrifts connection to ZK goes stale and then
Re: Thrift service + zklocks + time = death
Premise of the locking is that in well-behaved queries you will obtain a lock at the start of the query and then relinquish it at the end of query. Since, locks are lease-based, in badly-behaving queries after lease expires, lock will be relinquished. Are you are NPE for same repeated query after couple of days it executing fine or you have many different queries of which some acquire locks while other dont? Do you see the failure on same kind of query every time or its a different query every time. It will be useful to find out which queries are resulting in this to debug this further. Ashutosh On Wed, Apr 11, 2012 at 07:43, Edward Capriolo edlinuxg...@gmail.comwrote: BTW the problem is not tables being locked. On the server side acquireLocks is getting a null pointer. 32432495-OK 32432498:Error in acquireLocks: java.lang.NullPointerException 32432552-FAILED: Error in acquiring locks: locks on the underlying objects cannot be acquired. retry after some time 32432660-Hive history file=/tmp/hadoop/hive_job_log_hadoop_201204111032_691199436.txt 32432737-Hive history file=/tmp/hadoop/hive_job_log_hadoop_201204111032_1845073466.txt On Wed, Apr 11, 2012 at 10:29 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hey all, Running 0.7.1 We are using a combination of hive-thrift+ zookeeper for locking. All works well for a couple days until eventually we hit this condition. HiveServerException(message:Query returned non-zero code: 10, cause: FAILED: Error in acquiring locks: locks on the underlying objects cannot be acquired. retry after some time, errorCode:10, SQLState:42000) at org.apache.hadoop.hive.service.ThriftHive$execute_result.read(ThriftHive.java:1494) at org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:120) at org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:229) at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:52) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124) XXX XXX XXX at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145) XXX XXX at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149) at core.ScriptTemplate.run(ScriptTemplate.groovy:28) at core.ScriptTemplate$run.call(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124) at core.ScriptTemplateRunner.main(ScriptTemplateRunner.groovy:19) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:88) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233) at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1302) at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:767) at groovy.lang.GroovyShell.runScriptOrMainOrTestOrRunnable(GroovyShell.java:273) at groovy.lang.GroovyShell.run(GroovyShell.java:229) at groovy.lang.GroovyShell.run(GroovyShell.java:159) at groovy.ui.GroovyMain.processOnce(GroovyMain.java:496) at groovy.ui.GroovyMain.run(GroovyMain.java:311) at
Re: Lag function in Hive
Hey Harish, Awesome work on SQL Windowing. Judging from participation on this thread, it seems windowing is of sizable interest to Hive community. Would you consider contributing your work upstream in Hive? If its in Hive contrib, it will be accessible to lot of folks using Hive out of box. Thanks, Ashutosh On Tue, Apr 10, 2012 at 08:10, Butani, Harish harish.but...@sap.com wrote: Hi Karan, SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki) maybe a good fit for your use case. We have a lag function and you can say something like From table Partition by col1, col2... Order by col1, col2,... Select colX, colX - lag(colX, 1) (there is a lag example on the wiki, and other time series egs based on the NPath table function) You can control the partitioning by the partitioning and order clauses. Partitions could be arbitrarily large (so you could partition by a dummy column and have all rows in 1 partition) but works best when there are natural partitions in your data and you are ok with not needing to calculate across partitions. Regards, Harish. -Original Message- From: karanveer.si...@barclays.com [mailto:karanveer.si...@barclays.com] Sent: Tuesday, April 10, 2012 7:52 AM To: user@hive.apache.org Subject: Re: Lag function in Hive Thanks - I will check this out. Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? Rgds - Original Message - From: David Kulp dk...@fiksu.com To: user@hive.apache.org user@hive.apache.org Sent: Tue Apr 10 15:45:25 2012 Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, karanveer.si...@barclays.com karanveer.si...@barclays.com wrote: Makes sense but is not the distribution across nodes for a chunk of records in that order. If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. Regards, Karan -Original Message- From: Philip Tromans [mailto:philip.j.trom...@gmail.com] Sent: 10 April 2012 19:48 To: user@hive.apache.org Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, karanveer.si...@barclays.com wrote: Hi, Is there something like a 'lag' function in HIVE? The requirement is to calculate difference for the same column for every 2 subsequent records. For example. Row, Column A, Column B 1, 10, 100 2, 20, 200 3, 30, 300 The result that I need should be like: Row, Column A, Column B, Result 1, 10, 100, NULL 2, 20, 200, 100 (200-100) 3, 30, 300, 100 (300-200) Rgds, Karan This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications by any third party, or from the transmission of any viruses. Replies to this e-mail may be monitored by the Barclays Group for operational or business reasons. Any opinion or other information in this e-mail or its attachments that does not relate to the business of the Barclays Group is personal to the sender and is not given or endorsed by the Barclays Group. Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. Barclays Bank PLC is authorised and regulated by the Financial Services Authority. This e-mail and any
Case Studies for 'Programming Hive' book from O'Reilly
Dear Hive User, We want your interesting case study for our upcoming book titled 'Programming Hive' from O'Reilly. How you use Hive, either high level or low level code details are both encouraged! Feel free to reach out with a brief abstract. Regards, Jason Rutherglen
Re: Graph/Plotting in Hive/Hadoop
Hey hive is qury language for no sql, by default i dont think it can be used for plotting graphs rather you can get data through and format it in required shape and the plot using some api, provided ∞ Shashwat Shriparv
Re: RE: Re: cannot start the thrift server
Hey check this and according to your version download the jdbc jars and follow the steps it will help for sure, as this is given in context to amazon but it works for hive running at local pc http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html On Wed, Apr 11, 2012 at 10:33 AM, ylyy-1985 ylyy-1...@163.com wrote: ** ** ** thanks to all~I will try! 2012-04-11 -- ** Best Regards Andes Email:ylyy-1...@163.com ** -- *发件人:*binhnt22 *发送时间:*2012-04-11 12:03 *主题:*RE: Re: cannot start the thrift server *收件人:*useruser@hive.apache.org *抄送:* By default, hive will use derby as a embedded database for metastore. Your command fail because of incorrect metastore configuration. I highly recommend you use mysql as metastore. The detail information will be found at: http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx It’s pretty simple, I’ll summarize as following: **- **Install mysql server on your VM Centos **- **Create an user for hive, grant all privilege **- **Modify (or create new) hive-site.xml file in $HIVE_PATH/conf/ with **- **Download then copy msql-connector-java-*.jar to $HIVE_PATH/lib When everything done, “show tables;” will work *Best regards* Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 *From:* ylyy-1985 [mailto:ylyy-1...@163.com] *Sent:* Wednesday, April 11, 2012 10:37 AM *To:* user *Subject:* 回复: Re: cannot start the thrift server thank you very much. now that it's running, I use it to test the sample code first. According to my operation,(with the server is running) I type ./hive to enter hive console, but when I use show tables; the console print out some info: FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to start database 'metastore_db', see the next exception for details. NestedThrowables: java.sql.SQLException: Failed to start database 'metastore_db', see the next exception for details. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask someone says the metadata is lock, so...it doesn't matter even if I can't access the CLI, but things will be much better if I can check the tables with the CLI and the server running. Or can I check the database using other method? 2012-04-11 -- Best Regards Andes Email:ylyy-1...@163.com -- *发件人:*binhnt22 *发送时间:*2012-04-11 11:22 *主**题:*Re: cannot start the thrift server *收件人:*useruser@hive.apache.org *抄送:* IMHO, you should use bridge, so that any other computer in LAN can connect to 192.168.164.128 without problem When the thrift server is running, it’s not like the hive command disable. The console will only produce output when you have a job running through thrift server. If you want to use CLI, just run $HIVE_PATH/bin/hive *Best regards* Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 *From:* ylyy-1985 [mailto:ylyy-1...@163.com] *Sent:* Wednesday, April 11, 2012 9:49 AM *To:* user *Subject:* 回复: RE: RE: Re: cannot start the thrift server centos 6 in VM and NAT network(by default). here is what I do. I see there are some different ways to form the connection string like:*** * Connection con = DriverManager.getConnection(jdbc:hive://IP:1/default, , ); Connection con = DriverManager.getConnection(jdbc:hive://); so I use the second one, I don't know what is the difference but it does connect me to hive. btw, when the thrift server is running, is seems the hive command line is disabled? is it because the metadata is locked? 2012-04-11 -- Andes -- *发件人:*binhnt22 *发送时间:*2012-04-11 10:39 *主**题:*RE: RE: Re: cannot start the thrift server *收件人:*useruser@hive.apache.org *抄送:* Hi ylyy, What’s OS in your VM? Which network configuration is used in your VM: NAT, bridge, host-only or LAN segment? Answer this may give clue to solve: No route to host. BTW: default username and password is “empty” Connection con = DriverManager.*getConnection*( jdbc:hive://hiveserverip:1/default, , ); *Best regards* Nguyen Thanh Binh (Mr) Cell phone: (+84)98.226.0622 *From:* ylyy-1985 [mailto:ylyy-1...@163.com] *Sent:* Wednesday, April 11, 2012 9:29 AM *To:* user *Subject:* 回复: RE: Re: cannot start the thrift server yeah, great,thanks binhnt22. what a fool I am. Hah!! but I get this then. Unable to create log directory /tmp/Administrator 2012-04-11 10:23:15,359 ERROR [main]
Re: Graph/Plotting in Hive/Hadoop
The historgram UDF can be used to create data from GNU-plot. Here is something I do: Hive's default output is tab delimited. There are javascript graphing libraries like amcharts. I produce tab files and let simple HTML pages be the front end for AM charts. This allows me to create raw data with hive and quickly display this in a web browser without the need for server side coding. Edward On Wed, Apr 11, 2012 at 3:46 PM, shashwat shriparv dwivedishash...@gmail.com wrote: Hey hive is qury language for no sql, by default i dont think it can be used for plotting graphs rather you can get data through and format it in required shape and the plot using some api, provided ∞ Shashwat Shriparv
RE: Lag function in Hive
Hi Ashutosh, Thanks for taking a look. Yes definitely open to contributing back to Hive. Had a conversation with Carl Steinbach last week about this. Will send you a follow up message. Regards, Harish From: Ashutosh Chauhan [mailto:hashut...@apache.org] Sent: Wednesday, April 11, 2012 7:55 AM To: user@hive.apache.org; Butani, Harish Subject: Re: Lag function in Hive Hey Harish, Awesome work on SQL Windowing. Judging from participation on this thread, it seems windowing is of sizable interest to Hive community. Would you consider contributing your work upstream in Hive? If its in Hive contrib, it will be accessible to lot of folks using Hive out of box. Thanks, Ashutosh On Tue, Apr 10, 2012 at 08:10, Butani, Harish harish.but...@sap.commailto:harish.but...@sap.com wrote: Hi Karan, SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki) maybe a good fit for your use case. We have a lag function and you can say something like From table Partition by col1, col2... Order by col1, col2,... Select colX, colX - lag(colX, 1) (there is a lag example on the wiki, and other time series egs based on the NPath table function) You can control the partitioning by the partitioning and order clauses. Partitions could be arbitrarily large (so you could partition by a dummy column and have all rows in 1 partition) but works best when there are natural partitions in your data and you are ok with not needing to calculate across partitions. Regards, Harish. -Original Message- From: karanveer.si...@barclays.commailto:karanveer.si...@barclays.com [mailto:karanveer.si...@barclays.commailto:karanveer.si...@barclays.com] Sent: Tuesday, April 10, 2012 7:52 AM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Lag function in Hive Thanks - I will check this out. Meanwhile, would default clustering happen using rownum? How can I check on how is clustering happening in our environment? Rgds - Original Message - From: David Kulp dk...@fiksu.commailto:dk...@fiksu.com To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Sent: Tue Apr 10 15:45:25 2012 Subject: Re: Lag function in Hive New here. Hello all. Could you try a self-join, possibly also restricted to partitions? E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar If your data is clustered by rownum, then this join should, in theory, be relatively fast -- especially if it makes sense to exploit partitions. -d On Apr 10, 2012, at 10:37 AM, karanveer.si...@barclays.commailto:karanveer.si...@barclays.com karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote: Makes sense but is not the distribution across nodes for a chunk of records in that order. If Hive cannot help me do this, is there another way I can do this? I tried generating an identifier using the perl script invoked using Hive but it does not seem to work fine. While the stand alone script works fine, when the record is created in hive using std output from perl - I see 2 records for some of the unique identifiers. I explored the possibility of default data type changes but that does not solve the problem. Regards, Karan -Original Message- From: Philip Tromans [mailto:philip.j.trom...@gmail.commailto:philip.j.trom...@gmail.com] Sent: 10 April 2012 19:48 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Lag function in Hive Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult to guarantee). Cheers, Phil. On 10 April 2012 14:44, karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote: Hi, Is there something like a 'lag' function in HIVE? The requirement is to calculate difference for the same column for every 2 subsequent records. For example. Row, Column A, Column B 1, 10, 100 2, 20, 200 3, 30, 300 The result that I need should be like: Row, Column A, Column B, Result 1, 10, 100, NULL 2, 20, 200, 100 (200-100) 3, 30, 300, 100 (300-200) Rgds, Karan This e-mail and any attachments are confidential and intended solely for the addressee and may also be privileged or exempt from disclosure under applicable law. If you are not the addressee, or have received this e-mail in error, please notify the sender immediately, delete it from your system and do not copy, disclose or otherwise act upon any part of this e-mail or its attachments. Internet communications are not guaranteed to be secure or virus-free. The Barclays Group does not accept responsibility for any loss arising from unauthorised access to, or interference with, any Internet communications