Re: Lag function in Hive

2012-04-11 Thread Nitin Pawar
does your table have column called rownum?

I think From Philip's mail, it was just an example

On Wed, Apr 11, 2012 at 11:13 AM, karanveer.si...@barclays.com wrote:


 When I try using rownum in my Hive QL query, I get: Invalid column
 reference rownum. Am I missing something here?

 Regards,
 Karan


 -Original Message-
 From: David Kulp [mailto:dk...@fiksu.com]
 Sent: 10 April 2012 20:15
 To: user@hive.apache.org
 Subject: Re: Lag function in Hive

 New here.  Hello all.

 Could you try a self-join, possibly also restricted to partitions?

 E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE
 t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar

 If your data is clustered by rownum, then this join should, in theory, be
 relatively fast -- especially if it makes sense to exploit partitions.

 -d

 On Apr 10, 2012, at 10:37 AM, karanveer.si...@barclays.com 
 karanveer.si...@barclays.com wrote:

  Makes sense but is not the distribution across nodes for a chunk of
 records in that order.
 
  If Hive cannot help me do this, is there another way I can do this? I
 tried generating an identifier using the perl script invoked using Hive but
 it does not seem to work fine. While the stand alone script works fine,
 when the record is created in hive using std output from perl - I see 2
 records for some of the unique identifiers. I explored the possibility of
 default data type changes but that does not solve the problem.
 
  Regards,
  Karan
 
 
  -Original Message-
  From: Philip Tromans [mailto:philip.j.trom...@gmail.com]
  Sent: 10 April 2012 19:48
  To: user@hive.apache.org
  Subject: Re: Lag function in Hive
 
  Hi Karan,
 
  To the best of my knowledge, there isn't one. It's also unlikely to
  happen because it's hard to parallelise in a map-reduce way (it
  requires knowing where you are in a result set, and who your
  neighbours are and they in turn need to be present on the same node as
  you which is difficult to guarantee).
 
  Cheers,
 
  Phil.
 
  On 10 April 2012 14:44,  karanveer.si...@barclays.com wrote:
  Hi,
 
  Is there something like a 'lag' function in HIVE? The requirement is to
  calculate difference for the same column for every 2 subsequent records.
 
  For example.
 
  Row, Column A, Column B
  1, 10, 100
  2, 20, 200
  3, 30, 300
 
 
  The result that I need should be like:
 
  Row, Column A, Column B, Result
  1, 10, 100, NULL
  2, 20, 200, 100 (200-100)
  3, 30, 300, 100 (300-200)
 
  Rgds,
  Karan
 
 
 
 
 
  This e-mail and any attachments are confidential and intended solely
 for the
  addressee and may also be privileged or exempt from disclosure under
  applicable law. If you are not the addressee, or have received this
 e-mail
  in error, please notify the sender immediately, delete it from your
 system
  and do not copy, disclose or otherwise act upon any part of this e-mail
 or
  its attachments.
 
  Internet communications are not guaranteed to be secure or virus-free.
  The Barclays Group does not accept responsibility for any loss arising
 from
  unauthorised access to, or interference with, any Internet
 communications by
  any third party, or from the transmission of any viruses. Replies to
 this
  e-mail may be monitored by the Barclays Group for operational or
 business
  reasons.
 
  Any opinion or other information in this e-mail or its attachments that
 does
  not relate to the business of the Barclays Group is personal to the
 sender
  and is not given or endorsed by the Barclays Group.
 
  Barclays Bank PLC.Registered in England and Wales (registered no.
 1026167).
  Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
 
  Barclays Bank PLC is authorised and regulated by the Financial Services
  Authority.




-- 
Nitin Pawar


RE: Lag function in Hive

2012-04-11 Thread karanveer.singh
Rob n all - 

I tried below and created the jar file. For adding jar to class path, I do 
following:

hive add jar /users/unix/singhka/Analytics.jar;

The above seems to have worked fine as I see the resource added but when I go 
ahead and create a function, I get the following error. Any ideas what the 
issue can be?

hive create temporary function lag as 'com.example.hive.udf.Lag';
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask


Regards,


-Original Message-
From: Hamilton, Robert (Austin) [mailto:robert.hamil...@hp.com] 
Sent: 10 April 2012 20:32
To: user@hive.apache.org
Subject: RE: Lag function in Hive

You can write a custom UDF - 

Here is one that I have played around with, along with some test SQL. It comes 
with no warrantee :) 

Sorry I can't really share the test data, but hopefully you get the idea.  To 
run, compile the Lag class, jar it up into Analytics.jar, put the jar on the 
CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the 
hive command below.

Note the distribute by and sort by  are critical.  Also the sub-select is 
just an artifice to make sure the UDF is running in the reducer (so that it is 
sorted).  Maybe the hive experts can suggest a better way for that to work...

#
# use live clickstream test data from 2012-01-12
#
hive -e add jar Analytics.jar;

create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) 
from (select session_id,hit_datetime_gmt from omni2 where 
visit_day='2012-01-12' and session_id is not null 
distribute by session_id 
sort by session_id,hit_datetime_gmt ) X 
distribute by session_id limit 1000


 Contents of Lag.java 
-
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public final class Lag extends UDF{
private int  counter;
private String last_key;
private String lastGroup;
private String return_value=;

public String evaluate(String key, String groupKey){
if(groupKey==null){
this.last_key=null;
}else
  if ( !groupKey.equalsIgnoreCase(this.lastGroup )) {
this.last_key=null;
}
 return_value=this.last_key;
 this.last_key = key;
 this.lastGroup=groupKey;
 return return_value;
}
}

Result of test run:

1326326437-26270601625187049522752846106448274394   2012-01-12 00:00:37 
NULL
1326326437-26270601625187049522752846106448274394   2012-01-12 00:00:59 
2012-01-12 00:00:37
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:05 
2012-01-12 00:00:59
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:07 
2012-01-12 00:01:05
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:11 
2012-01-12 00:01:07
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:12 
2012-01-12 00:01:11
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:24 
2012-01-12 00:01:12
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:32 
2012-01-12 00:01:24
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:45 
2012-01-12 00:01:32
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:48 
2012-01-12 00:01:45

-Original Message-
From: Philip Tromans [mailto:philip.j.trom...@gmail.com] 
Sent: Tuesday, April 10, 2012 9:18 AM
To: user@hive.apache.org
Subject: Re: Lag function in Hive

Hi Karan,

To the best of my knowledge, there isn't one. It's also unlikely to happen 
because it's hard to parallelise in a map-reduce way (it requires knowing where 
you are in a result set, and who your neighbours are and they in turn need to 
be present on the same node as you which is difficult to guarantee).

Cheers,

Phil.

On 10 April 2012 14:44,  karanveer.si...@barclays.com wrote:
 Hi,

 Is there something like a 'lag' function in HIVE? The requirement is 
 to calculate difference for the same column for every 2 subsequent records.

 For example.

 Row, Column A, Column B
 1, 10, 100
 2, 20, 200
 3, 30, 300


 The result that I need should be like:

 Row, Column A, Column B, Result
 1, 10, 100, NULL
 2, 20, 200, 100 (200-100)
 3, 30, 300, 100 (300-200)

 Rgds,
 Karan





 This e-mail and any attachments are confidential and intended solely 
 for the addressee and may also be privileged or exempt from disclosure 
 under applicable law. If you are not the addressee, or have received 
 this e-mail in error, please notify the sender immediately, delete it 
 from your system and do not copy, disclose or otherwise act upon any 
 part of this e-mail or its attachments.

 Internet communications are not guaranteed to be secure or virus-free.
 The Barclays Group does not accept responsibility for any loss arising 
 from unauthorised access 

RE: Lag function in Hive

2012-04-11 Thread karanveer.singh
That's the whole problem rite, I am unable to create a unique column for my 
record rows within Hive. If that's there, I can get the lag functionality to 
work for me.
I was hoping that ROWNUM will act like a pseudo column in Hive.


Regards,


From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
Sent: 11 April 2012 12:15
To: user@hive.apache.org
Subject: Re: Lag function in Hive

does your table have column called rownum?

I think From Philip's mail, it was just an example
On Wed, Apr 11, 2012 at 11:13 AM, 
karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote:

When I try using rownum in my Hive QL query, I get: Invalid column reference 
rownum. Am I missing something here?

Regards,
Karan


-Original Message-
From: David Kulp [mailto:dk...@fiksu.commailto:dk...@fiksu.com]
Sent: 10 April 2012 20:15
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Lag function in Hive

New here.  Hello all.

Could you try a self-join, possibly also restricted to partitions?

E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = 
t2.rownum+1 AND t1.partition=foo AND t2.partition=bar

If your data is clustered by rownum, then this join should, in theory, be 
relatively fast -- especially if it makes sense to exploit partitions.

-d

On Apr 10, 2012, at 10:37 AM, 
karanveer.si...@barclays.commailto:karanveer.si...@barclays.com 
karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote:

 Makes sense but is not the distribution across nodes for a chunk of records 
 in that order.

 If Hive cannot help me do this, is there another way I can do this? I tried 
 generating an identifier using the perl script invoked using Hive but it does 
 not seem to work fine. While the stand alone script works fine, when the 
 record is created in hive using std output from perl - I see 2 records for 
 some of the unique identifiers. I explored the possibility of default data 
 type changes but that does not solve the problem.

 Regards,
 Karan


 -Original Message-
 From: Philip Tromans 
 [mailto:philip.j.trom...@gmail.commailto:philip.j.trom...@gmail.com]
 Sent: 10 April 2012 19:48
 To: user@hive.apache.orgmailto:user@hive.apache.org
 Subject: Re: Lag function in Hive

 Hi Karan,

 To the best of my knowledge, there isn't one. It's also unlikely to
 happen because it's hard to parallelise in a map-reduce way (it
 requires knowing where you are in a result set, and who your
 neighbours are and they in turn need to be present on the same node as
 you which is difficult to guarantee).

 Cheers,

 Phil.

 On 10 April 2012 14:44,  
 karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote:
 Hi,

 Is there something like a 'lag' function in HIVE? The requirement is to
 calculate difference for the same column for every 2 subsequent records.

 For example.

 Row, Column A, Column B
 1, 10, 100
 2, 20, 200
 3, 30, 300


 The result that I need should be like:

 Row, Column A, Column B, Result
 1, 10, 100, NULL
 2, 20, 200, 100 (200-100)
 3, 30, 300, 100 (300-200)

 Rgds,
 Karan





 This e-mail and any attachments are confidential and intended solely for the
 addressee and may also be privileged or exempt from disclosure under
 applicable law. If you are not the addressee, or have received this e-mail
 in error, please notify the sender immediately, delete it from your system
 and do not copy, disclose or otherwise act upon any part of this e-mail or
 its attachments.

 Internet communications are not guaranteed to be secure or virus-free.
 The Barclays Group does not accept responsibility for any loss arising from
 unauthorised access to, or interference with, any Internet communications by
 any third party, or from the transmission of any viruses. Replies to this
 e-mail may be monitored by the Barclays Group for operational or business
 reasons.

 Any opinion or other information in this e-mail or its attachments that does
 not relate to the business of the Barclays Group is personal to the sender
 and is not given or endorsed by the Barclays Group.

 Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
 Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.

 Barclays Bank PLC is authorised and regulated by the Financial Services
 Authority.



--
Nitin Pawar


This e-mail and any attachments are confidential and intended
solely for the addressee and may also be privileged or exempt from
disclosure under applicable law. If you are not the addressee, or
have received this e-mail in error, please notify the sender
immediately, delete it from your system and do not copy, disclose
or otherwise act upon any part of this e-mail or its attachments.

Internet communications are not guaranteed to be secure or
virus-free.
The Barclays Group does not accept responsibility for any loss
arising from unauthorised access to, or interference with, any
Internet communications by any third party, 

Does Hive supports EXISTS keyword in select query?

2012-04-11 Thread Bhavesh Shah
Hello all,
I want to query like below in Hive:
Select a.* FROM tblA a JOIN tblB b ON a.field1 = b.field1
where (a.field2 is null or not exists(select field2 from tblB where filed2
is not null)

But I think Hive doesn't supports EXISTS keyword so how can I overcome this
issue?
Pls suggest me some solution to this. I just got this kind of situation
where I need to implement some thing like EXISTS/NOT EXISTS



-- 
Thanks and Regards,
Bhavesh Shah


Re: Does Hive supports EXISTS keyword in select query?

2012-04-11 Thread Philip Tromans
Hi,

Hive supports EXISTS via SEMI JOIN. Have a look at:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

Cheers,

Phil.

On 11 April 2012 13:59, Bhavesh Shah bhavesh25s...@gmail.com wrote:
 Hello all,
 I want to query like below in Hive:
 Select a.* FROM tblA a JOIN tblB b ON a.field1 = b.field1
 where (a.field2 is null or not exists(select field2 from tblB where filed2
 is not null)

 But I think Hive doesn't supports EXISTS keyword so how can I overcome this
 issue?
 Pls suggest me some solution to this. I just got this kind of situation
 where I need to implement some thing like EXISTS/NOT EXISTS



 --
 Thanks and Regards,
 Bhavesh Shah



Re: Lag function in Hive

2012-04-11 Thread Mark Grover
Hi Karan,
The error you mentioned you get on creating the temporary function typically 
happens when there is a typo in the class name (com.example.hive.udf.Lag, in 
this case).

Can you ensure that the jar was properly built and contains the Lag class in 
the com.example.hive.udf package?

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

Best Trading Platform - World Finance's Forex Awards 2009. 
The One to Watch - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: karanveer singh karanveer.si...@barclays.com
To: user@hive.apache.org
Sent: Wednesday, April 11, 2012 4:15:59 AM
Subject: RE: Lag function in Hive

Rob n all - 

I tried below and created the jar file. For adding jar to class path, I do 
following:

hive add jar /users/unix/singhka/Analytics.jar;

The above seems to have worked fine as I see the resource added but when I go 
ahead and create a function, I get the following error. Any ideas what the 
issue can be?

hive create temporary function lag as 'com.example.hive.udf.Lag';
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask


Regards,


-Original Message-
From: Hamilton, Robert (Austin) [mailto:robert.hamil...@hp.com] 
Sent: 10 April 2012 20:32
To: user@hive.apache.org
Subject: RE: Lag function in Hive

You can write a custom UDF - 

Here is one that I have played around with, along with some test SQL. It comes 
with no warrantee :) 

Sorry I can't really share the test data, but hopefully you get the idea.  To 
run, compile the Lag class, jar it up into Analytics.jar, put the jar on the 
CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the 
hive command below.

Note the distribute by and sort by  are critical.  Also the sub-select is 
just an artifice to make sure the UDF is running in the reducer (so that it is 
sorted).  Maybe the hive experts can suggest a better way for that to work...

#
# use live clickstream test data from 2012-01-12
#
hive -e add jar Analytics.jar;

create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) 
from (select session_id,hit_datetime_gmt from omni2 where 
visit_day='2012-01-12' and session_id is not null 
distribute by session_id 
sort by session_id,hit_datetime_gmt ) X 
distribute by session_id limit 1000


 Contents of Lag.java 
-
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public final class Lag extends UDF{
private int  counter;
private String last_key;
private String lastGroup;
private String return_value=;

public String evaluate(String key, String groupKey){
if(groupKey==null){
this.last_key=null;
}else
  if ( !groupKey.equalsIgnoreCase(this.lastGroup )) {
this.last_key=null;
}
 return_value=this.last_key;
 this.last_key = key;
 this.lastGroup=groupKey;
 return return_value;
}
}

Result of test run:

1326326437-26270601625187049522752846106448274394   2012-01-12 00:00:37 
NULL
1326326437-26270601625187049522752846106448274394   2012-01-12 00:00:59 
2012-01-12 00:00:37
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:05 
2012-01-12 00:00:59
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:07 
2012-01-12 00:01:05
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:11 
2012-01-12 00:01:07
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:12 
2012-01-12 00:01:11
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:24 
2012-01-12 00:01:12
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:32 
2012-01-12 00:01:24
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:45 
2012-01-12 00:01:32
1326326437-26270601625187049522752846106448274394   2012-01-12 00:01:48 
2012-01-12 00:01:45

-Original Message-
From: Philip Tromans [mailto:philip.j.trom...@gmail.com] 
Sent: Tuesday, April 10, 2012 9:18 AM
To: user@hive.apache.org
Subject: Re: Lag function in Hive

Hi Karan,

To the best of my knowledge, there isn't one. It's also unlikely to happen 
because it's hard to parallelise in a map-reduce way (it requires knowing where 
you are in a result set, and who your neighbours are and they in turn need to 
be present on the same node as you which is difficult to guarantee).

Cheers,

Phil.

On 10 April 2012 14:44,  karanveer.si...@barclays.com wrote:
 Hi,

 Is there something like a 'lag' function in HIVE? The requirement is 
 to calculate difference for the same column for every 2 subsequent records.

 For example.

 Row, Column A, Column B
 1, 10, 100
 2, 20, 200
 3, 30, 300


 The result 

RE: Does Hive supports EXISTS keyword in select query?

2012-04-11 Thread Hezhiqiang (Ransom)
SEMI is only for exist.
Maybe you can try this 
 Select a.* FROM tblA a left outer JOIN tblB b ON a.field1 = b.field1 where 
a.field2 is null or b.fild2 is null


Best regards
Ransom.

-Original Message-
From: Philip Tromans [mailto:philip.j.trom...@gmail.com] 
Sent: Wednesday, April 11, 2012 9:02 PM
To: user@hive.apache.org
Subject: Re: Does Hive supports EXISTS keyword in select query?

Hi,

Hive supports EXISTS via SEMI JOIN. Have a look at:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

Cheers,

Phil.

On 11 April 2012 13:59, Bhavesh Shah bhavesh25s...@gmail.com wrote:
 Hello all,
 I want to query like below in Hive:
 Select a.* FROM tblA a JOIN tblB b ON a.field1 = b.field1
 where (a.field2 is null or not exists(select field2 from tblB where filed2
 is not null)

 But I think Hive doesn't supports EXISTS keyword so how can I overcome this
 issue?
 Pls suggest me some solution to this. I just got this kind of situation
 where I need to implement some thing like EXISTS/NOT EXISTS



 --
 Thanks and Regards,
 Bhavesh Shah



Thrift service + zklocks + time = death

2012-04-11 Thread Edward Capriolo
Hey all,

Running 0.7.1
We are using a combination of hive-thrift+ zookeeper for locking. All
works well for a couple days until eventually we hit this condition.

HiveServerException(message:Query returned non-zero code: 10, cause:
FAILED: Error in acquiring locks: locks on the underlying objects
cannot be acquired. retry after some time, errorCode:10,
SQLState:42000)
at 
org.apache.hadoop.hive.service.ThriftHive$execute_result.read(ThriftHive.java:1494)
at 
org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:120)
at 
org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:95)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:229)
at 
org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:52)
at 
org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
   XXX
   XXX
   XXX
at 
org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
   XXX
   XXX
at 
org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149)
at core.ScriptTemplate.run(ScriptTemplate.groovy:28)
at core.ScriptTemplate$run.call(Unknown Source)
at 
org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
at 
org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
at core.ScriptTemplateRunner.main(ScriptTemplateRunner.groovy:19)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:88)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1302)
at 
org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:767)
at 
groovy.lang.GroovyShell.runScriptOrMainOrTestOrRunnable(GroovyShell.java:273)
at groovy.lang.GroovyShell.run(GroovyShell.java:229)
at groovy.lang.GroovyShell.run(GroovyShell.java:159)
at groovy.ui.GroovyMain.processOnce(GroovyMain.java:496)
at groovy.ui.GroovyMain.run(GroovyMain.java:311)
at groovy.ui.GroovyMain.process(GroovyMain.java:297)
at groovy.ui.GroovyMain.processArgs(GroovyMain.java:112)
at groovy.ui.GroovyMain.main(GroovyMain.java:93)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:108)
at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130)
 }

It seems that eventually the hivethrifts connection to ZK goes stale
and then all queries fail with this message. Delta to event might be 3
days of 100 session. Any ideas?

Thanks,
Edward


Re: Thrift service + zklocks + time = death

2012-04-11 Thread Edward Capriolo
BTW the problem is not tables being locked. On the server side
acquireLocks is getting a null pointer.

32432495-OK
32432498:Error in acquireLocks: java.lang.NullPointerException
32432552-FAILED: Error in acquiring locks: locks on the underlying
objects cannot be acquired. retry after some time
32432660-Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201204111032_691199436.txt
32432737-Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201204111032_1845073466.txt


On Wed, Apr 11, 2012 at 10:29 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 Hey all,

 Running 0.7.1
 We are using a combination of hive-thrift+ zookeeper for locking. All
 works well for a couple days until eventually we hit this condition.

 HiveServerException(message:Query returned non-zero code: 10, cause:
 FAILED: Error in acquiring locks: locks on the underlying objects
 cannot be acquired. retry after some time, errorCode:10,
 SQLState:42000)
        at 
 org.apache.hadoop.hive.service.ThriftHive$execute_result.read(ThriftHive.java:1494)
        at 
 org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:120)
        at 
 org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:95)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
 org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:229)
        at 
 org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:52)
        at 
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
       XXX
       XXX
       XXX
        at 
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
       XXX
       XXX
        at 
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149)
        at core.ScriptTemplate.run(ScriptTemplate.groovy:28)
        at core.ScriptTemplate$run.call(Unknown Source)
        at 
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
        at 
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
        at core.ScriptTemplateRunner.main(ScriptTemplateRunner.groovy:19)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
 org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:88)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
        at 
 groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1302)
        at 
 org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:767)
        at 
 groovy.lang.GroovyShell.runScriptOrMainOrTestOrRunnable(GroovyShell.java:273)
        at groovy.lang.GroovyShell.run(GroovyShell.java:229)
        at groovy.lang.GroovyShell.run(GroovyShell.java:159)
        at groovy.ui.GroovyMain.processOnce(GroovyMain.java:496)
        at groovy.ui.GroovyMain.run(GroovyMain.java:311)
        at groovy.ui.GroovyMain.process(GroovyMain.java:297)
        at groovy.ui.GroovyMain.processArgs(GroovyMain.java:112)
        at groovy.ui.GroovyMain.main(GroovyMain.java:93)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
 org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:108)
        at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130)
  }

 It seems that eventually the hivethrifts connection to ZK goes stale
 and then 

Re: Thrift service + zklocks + time = death

2012-04-11 Thread Ashutosh Chauhan
 Premise of the locking is that in well-behaved queries you will obtain a
lock at the start of the query and then relinquish it at the end of query.
Since, locks are lease-based, in badly-behaving queries after lease
expires, lock will be relinquished.
Are you are  NPE for same repeated query after couple of days it executing
fine or you have many different queries of which some acquire locks while
other dont? Do you see the failure on same kind of query every time or its
a different query every time. It will be useful to find out which queries
are resulting in this to debug this further.

Ashutosh
On Wed, Apr 11, 2012 at 07:43, Edward Capriolo edlinuxg...@gmail.comwrote:

 BTW the problem is not tables being locked. On the server side
 acquireLocks is getting a null pointer.

 32432495-OK
 32432498:Error in acquireLocks: java.lang.NullPointerException
 32432552-FAILED: Error in acquiring locks: locks on the underlying
 objects cannot be acquired. retry after some time
 32432660-Hive history
 file=/tmp/hadoop/hive_job_log_hadoop_201204111032_691199436.txt
 32432737-Hive history
 file=/tmp/hadoop/hive_job_log_hadoop_201204111032_1845073466.txt


 On Wed, Apr 11, 2012 at 10:29 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
  Hey all,
 
  Running 0.7.1
  We are using a combination of hive-thrift+ zookeeper for locking. All
  works well for a couple days until eventually we hit this condition.
 
  HiveServerException(message:Query returned non-zero code: 10, cause:
  FAILED: Error in acquiring locks: locks on the underlying objects
  cannot be acquired. retry after some time, errorCode:10,
  SQLState:42000)
 at
 org.apache.hadoop.hive.service.ThriftHive$execute_result.read(ThriftHive.java:1494)
 at
 org.apache.hadoop.hive.service.ThriftHive$Client.recv_execute(ThriftHive.java:120)
 at
 org.apache.hadoop.hive.service.ThriftHive$Client.execute(ThriftHive.java:95)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:229)
 at
 org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:52)
 at
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
XXX
XXX
XXX
 at
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
XXX
XXX
 at
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:44)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:141)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:149)
 at core.ScriptTemplate.run(ScriptTemplate.groovy:28)
 at core.ScriptTemplate$run.call(Unknown Source)
 at
 org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:40)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
 at
 org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
 at core.ScriptTemplateRunner.main(ScriptTemplateRunner.groovy:19)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:88)
 at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
 at
 groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1302)
 at
 org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:767)
 at
 groovy.lang.GroovyShell.runScriptOrMainOrTestOrRunnable(GroovyShell.java:273)
 at groovy.lang.GroovyShell.run(GroovyShell.java:229)
 at groovy.lang.GroovyShell.run(GroovyShell.java:159)
 at groovy.ui.GroovyMain.processOnce(GroovyMain.java:496)
 at groovy.ui.GroovyMain.run(GroovyMain.java:311)
 at 

Re: Lag function in Hive

2012-04-11 Thread Ashutosh Chauhan
Hey Harish,

Awesome work on SQL Windowing. Judging from participation on this thread,
it seems windowing is of sizable interest to Hive community. Would you
consider contributing your work upstream in Hive? If its in Hive contrib,
it will be accessible to lot of folks using Hive out of box.

Thanks,
Ashutosh

On Tue, Apr 10, 2012 at 08:10, Butani, Harish harish.but...@sap.com wrote:

 Hi Karan,

 SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki)
 maybe a good fit for your use case.

 We have a lag function and you can say something like

 From table
 Partition by col1, col2...
 Order by col1, col2,...
 Select colX, colX - lag(colX, 1)

 (there is a lag example on the wiki, and other time series egs based on
 the NPath table function)

 You can control the partitioning by the partitioning and order clauses.
 Partitions could be arbitrarily large (so you could partition by a dummy
 column and have all rows in 1 partition) but works best when there are
 natural partitions in your data and you are ok with not needing to
 calculate across partitions.


 Regards,
 Harish.

 -Original Message-
 From: karanveer.si...@barclays.com [mailto:karanveer.si...@barclays.com]
 Sent: Tuesday, April 10, 2012 7:52 AM
 To: user@hive.apache.org
 Subject: Re: Lag function in Hive

 Thanks - I will check this out.

  Meanwhile, would default clustering happen using rownum? How can I check
 on how is clustering happening in our environment?

 Rgds

 - Original Message -
 From: David Kulp dk...@fiksu.com
 To: user@hive.apache.org user@hive.apache.org
 Sent: Tue Apr 10 15:45:25 2012
 Subject: Re: Lag function in Hive

 New here.  Hello all.

 Could you try a self-join, possibly also restricted to partitions?

 E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE
 t1.rownum = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar

 If your data is clustered by rownum, then this join should, in theory, be
 relatively fast -- especially if it makes sense to exploit partitions.

 -d

 On Apr 10, 2012, at 10:37 AM, karanveer.si...@barclays.com 
 karanveer.si...@barclays.com wrote:

  Makes sense but is not the distribution across nodes for a chunk of
 records in that order.
 
  If Hive cannot help me do this, is there another way I can do this? I
 tried generating an identifier using the perl script invoked using Hive but
 it does not seem to work fine. While the stand alone script works fine,
 when the record is created in hive using std output from perl - I see 2
 records for some of the unique identifiers. I explored the possibility of
 default data type changes but that does not solve the problem.
 
  Regards,
  Karan
 
 
  -Original Message-
  From: Philip Tromans [mailto:philip.j.trom...@gmail.com]
  Sent: 10 April 2012 19:48
  To: user@hive.apache.org
  Subject: Re: Lag function in Hive
 
  Hi Karan,
 
  To the best of my knowledge, there isn't one. It's also unlikely to
  happen because it's hard to parallelise in a map-reduce way (it
  requires knowing where you are in a result set, and who your
  neighbours are and they in turn need to be present on the same node as
  you which is difficult to guarantee).
 
  Cheers,
 
  Phil.
 
  On 10 April 2012 14:44,  karanveer.si...@barclays.com wrote:
  Hi,
 
  Is there something like a 'lag' function in HIVE? The requirement is to
  calculate difference for the same column for every 2 subsequent records.
 
  For example.
 
  Row, Column A, Column B
  1, 10, 100
  2, 20, 200
  3, 30, 300
 
 
  The result that I need should be like:
 
  Row, Column A, Column B, Result
  1, 10, 100, NULL
  2, 20, 200, 100 (200-100)
  3, 30, 300, 100 (300-200)
 
  Rgds,
  Karan
 
 
 
 
 
  This e-mail and any attachments are confidential and intended solely
 for the
  addressee and may also be privileged or exempt from disclosure under
  applicable law. If you are not the addressee, or have received this
 e-mail
  in error, please notify the sender immediately, delete it from your
 system
  and do not copy, disclose or otherwise act upon any part of this e-mail
 or
  its attachments.
 
  Internet communications are not guaranteed to be secure or virus-free.
  The Barclays Group does not accept responsibility for any loss arising
 from
  unauthorised access to, or interference with, any Internet
 communications by
  any third party, or from the transmission of any viruses. Replies to
 this
  e-mail may be monitored by the Barclays Group for operational or
 business
  reasons.
 
  Any opinion or other information in this e-mail or its attachments that
 does
  not relate to the business of the Barclays Group is personal to the
 sender
  and is not given or endorsed by the Barclays Group.
 
  Barclays Bank PLC.Registered in England and Wales (registered no.
 1026167).
  Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
 
  Barclays Bank PLC is authorised and regulated by the Financial Services
  Authority.


 This e-mail and any 

Case Studies for 'Programming Hive' book from O'Reilly

2012-04-11 Thread Jason Rutherglen
Dear Hive User,

We want your interesting case study for our upcoming book titled
'Programming Hive' from O'Reilly.

How you use Hive, either high level or low level code details are both
encouraged!

Feel free to reach out with a brief abstract.

Regards,

Jason Rutherglen


Re: Graph/Plotting in Hive/Hadoop

2012-04-11 Thread shashwat shriparv
Hey hive is qury language for no sql, by default i dont think it can be
used for plotting graphs rather you can get data through and format it in
required shape and the plot using some api, provided

∞
Shashwat Shriparv


Re: RE: Re: cannot start the thrift server

2012-04-11 Thread shashwat shriparv
Hey check this and according to your version download the jdbc jars and
follow the steps it will help for sure, as this is given in context to
amazon but it works for hive running at local pc

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html



On Wed, Apr 11, 2012 at 10:33 AM, ylyy-1985 ylyy-1...@163.com wrote:

 ** ** **
 thanks to all~I will try!

 2012-04-11
  --
  **
 Best Regards
 Andes

 Email:ylyy-1...@163.com
 **
  --
  *发件人:*binhnt22
 *发送时间:*2012-04-11 12:03
 *主题:*RE: Re: cannot start the thrift server
 *收件人:*useruser@hive.apache.org
 *抄送:*


 By default, hive will use derby as a embedded database for metastore. Your
 command fail because of incorrect metastore configuration.

 I highly recommend you use mysql as metastore. The detail information will
 be found at:
 http://www.mazsoft.com/blog/post/2010/02/01/Setting-up-HadoopHive-to-use-MySQL-as-metastore.aspx
 

 It’s pretty simple, I’ll summarize as following: 

 **-  **Install mysql server on your VM Centos

 **-  **Create an user for hive, grant all privilege

 **-  **Modify (or create new) hive-site.xml file in
 $HIVE_PATH/conf/ with 

 **-  **Download then copy msql-connector-java-*.jar to
 $HIVE_PATH/lib

 When everything done, “show tables;” will work

 *Best regards*

 Nguyen Thanh Binh (Mr)

 Cell phone: (+84)98.226.0622

 *From:* ylyy-1985 [mailto:ylyy-1...@163.com]
 *Sent:* Wednesday, April 11, 2012 10:37 AM
 *To:* user
 *Subject:* 回复: Re: cannot start the thrift server

 thank you very much. now that it's running, I use it to test the sample
 code first. According to my operation,(with the server is running) I type
 ./hive to enter hive console, but when I use show tables; the console
 print out some info: 

 

 FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Failed to
 start database 'metastore_db', see the next exception for details.
 NestedThrowables:
 java.sql.SQLException: Failed to start database 'metastore_db', see the
 next exception for details.
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.DDLTask

 

 someone says the metadata is lock, so...it doesn't matter even if I can't
 access the CLI, but things will be much better if I can check the tables
 with the CLI and the server running. Or can I check the database using
 other method?

 2012-04-11
  --

 Best Regards

 Andes

 

 Email:ylyy-1...@163.com
  --

 *发件人:*binhnt22

 *发送时间:*2012-04-11 11:22

 *主**题:*Re: cannot start the thrift server

 *收件人:*useruser@hive.apache.org

 *抄送:*

 

 IMHO, you should use bridge, so that any other computer in LAN can connect
 to 192.168.164.128 without problem

 When the thrift server is running, it’s not like the hive command disable.
 The console will only produce output when you have a job running through
 thrift server.

 If you want to use CLI, just run $HIVE_PATH/bin/hive

 *Best regards*

 Nguyen Thanh Binh (Mr)

 Cell phone: (+84)98.226.0622

 *From:* ylyy-1985 [mailto:ylyy-1...@163.com]
 *Sent:* Wednesday, April 11, 2012 9:49 AM
 *To:* user
 *Subject:* 回复: RE: RE: Re: cannot start the thrift server

 centos 6 in VM and NAT network(by default). 

  here is what I do.

 I see there are some different ways to form the connection string like:***
 *

 Connection con = DriverManager.getConnection(jdbc:hive://IP:1/default, 
 , );

 Connection con = DriverManager.getConnection(jdbc:hive://);

  

 so I use the second one, I don't know what is the difference but it does 
 connect me to hive.

  

 btw, when the thrift server is running, is seems the hive command line is 
 disabled? is it because the metadata is locked?

 2012-04-11
  --

 Andes
  --

 *发件人:*binhnt22

 *发送时间:*2012-04-11 10:39

 *主**题:*RE: RE: Re: cannot start the thrift server

 *收件人:*useruser@hive.apache.org

 *抄送:*

 Hi ylyy,

 What’s OS in your VM? Which network configuration is used in your VM: NAT,
 bridge, host-only or LAN segment?

 Answer this may give clue to solve: No route to host.

 BTW: default username and password is “empty”

 Connection con = DriverManager.*getConnection*(
 jdbc:hive://hiveserverip:1/default, , );

 *Best regards*

 Nguyen Thanh Binh (Mr)

 Cell phone: (+84)98.226.0622

 *From:* ylyy-1985 [mailto:ylyy-1...@163.com]
 *Sent:* Wednesday, April 11, 2012 9:29 AM
 *To:* user
 *Subject:* 回复: RE: Re: cannot start the thrift server

 yeah, great,thanks binhnt22. what a fool I am. Hah!! but I get this then.
 

 Unable to create log directory /tmp/Administrator
 2012-04-11 10:23:15,359 ERROR [main] 

Re: Graph/Plotting in Hive/Hadoop

2012-04-11 Thread Edward Capriolo
The historgram UDF can be used to create data from GNU-plot.

Here is something I do:

Hive's default output is tab delimited. There are javascript graphing
libraries like amcharts. I produce tab files and let simple HTML pages
be the front end for AM charts.

This allows me to create raw data with hive and quickly display this
in a web browser without the need for server side coding.

Edward




On Wed, Apr 11, 2012 at 3:46 PM, shashwat shriparv
dwivedishash...@gmail.com wrote:
 Hey hive is qury language for no sql, by default i dont think it can be used
 for plotting graphs rather you can get data through and format it in
 required shape and the plot using some api, provided

 ∞

 Shashwat Shriparv




RE: Lag function in Hive

2012-04-11 Thread Butani, Harish
Hi Ashutosh,

Thanks for taking a look. Yes definitely open to contributing back to Hive.
Had a conversation with Carl Steinbach last week about this.
Will send you a follow up message.

Regards,
Harish

From: Ashutosh Chauhan [mailto:hashut...@apache.org]
Sent: Wednesday, April 11, 2012 7:55 AM
To: user@hive.apache.org; Butani, Harish
Subject: Re: Lag function in Hive

Hey Harish,

Awesome work on SQL Windowing. Judging from participation on this thread, it 
seems windowing is of sizable interest to Hive community. Would you consider 
contributing your work upstream in Hive? If its in Hive contrib, it will be 
accessible to lot of folks using Hive out of box.

Thanks,
Ashutosh
On Tue, Apr 10, 2012 at 08:10, Butani, Harish 
harish.but...@sap.commailto:harish.but...@sap.com wrote:
Hi Karan,

SQL Windowing with Hive(https://github.com/hbutani/SQLWindowing/wiki) maybe a 
good fit for your use case.

We have a lag function and you can say something like

From table
Partition by col1, col2...
Order by col1, col2,...
Select colX, colX - lag(colX, 1)

(there is a lag example on the wiki, and other time series egs based on the 
NPath table function)

You can control the partitioning by the partitioning and order clauses.
Partitions could be arbitrarily large (so you could partition by a dummy column 
and have all rows in 1 partition) but works best when there are natural 
partitions in your data and you are ok with not needing to calculate across 
partitions.


Regards,
Harish.

-Original Message-
From: karanveer.si...@barclays.commailto:karanveer.si...@barclays.com 
[mailto:karanveer.si...@barclays.commailto:karanveer.si...@barclays.com]
Sent: Tuesday, April 10, 2012 7:52 AM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Lag function in Hive

Thanks - I will check this out.

 Meanwhile, would default clustering happen using rownum? How can I check on 
how is clustering happening in our environment?

Rgds

- Original Message -
From: David Kulp dk...@fiksu.commailto:dk...@fiksu.com
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Sent: Tue Apr 10 15:45:25 2012
Subject: Re: Lag function in Hive

New here.  Hello all.

Could you try a self-join, possibly also restricted to partitions?

E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = 
t2.rownum+1 AND t1.partition=foo AND t2.partition=bar

If your data is clustered by rownum, then this join should, in theory, be 
relatively fast -- especially if it makes sense to exploit partitions.

-d

On Apr 10, 2012, at 10:37 AM, 
karanveer.si...@barclays.commailto:karanveer.si...@barclays.com 
karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote:

 Makes sense but is not the distribution across nodes for a chunk of records 
 in that order.

 If Hive cannot help me do this, is there another way I can do this? I tried 
 generating an identifier using the perl script invoked using Hive but it does 
 not seem to work fine. While the stand alone script works fine, when the 
 record is created in hive using std output from perl - I see 2 records for 
 some of the unique identifiers. I explored the possibility of default data 
 type changes but that does not solve the problem.

 Regards,
 Karan


 -Original Message-
 From: Philip Tromans 
 [mailto:philip.j.trom...@gmail.commailto:philip.j.trom...@gmail.com]
 Sent: 10 April 2012 19:48
 To: user@hive.apache.orgmailto:user@hive.apache.org
 Subject: Re: Lag function in Hive

 Hi Karan,

 To the best of my knowledge, there isn't one. It's also unlikely to
 happen because it's hard to parallelise in a map-reduce way (it
 requires knowing where you are in a result set, and who your
 neighbours are and they in turn need to be present on the same node as
 you which is difficult to guarantee).

 Cheers,

 Phil.

 On 10 April 2012 14:44,  
 karanveer.si...@barclays.commailto:karanveer.si...@barclays.com wrote:
 Hi,

 Is there something like a 'lag' function in HIVE? The requirement is to
 calculate difference for the same column for every 2 subsequent records.

 For example.

 Row, Column A, Column B
 1, 10, 100
 2, 20, 200
 3, 30, 300


 The result that I need should be like:

 Row, Column A, Column B, Result
 1, 10, 100, NULL
 2, 20, 200, 100 (200-100)
 3, 30, 300, 100 (300-200)

 Rgds,
 Karan





 This e-mail and any attachments are confidential and intended solely for the
 addressee and may also be privileged or exempt from disclosure under
 applicable law. If you are not the addressee, or have received this e-mail
 in error, please notify the sender immediately, delete it from your system
 and do not copy, disclose or otherwise act upon any part of this e-mail or
 its attachments.

 Internet communications are not guaranteed to be secure or virus-free.
 The Barclays Group does not accept responsibility for any loss arising from
 unauthorised access to, or interference with, any Internet communications