RE: Lag function in Hive

karanveer.singh Wed, 11 Apr 2012 01:16:43 -0700

Rob n all - 

I tried below and created the jar file. For adding jar to class path, I do 
following:


hive> add jar /users/unix/singhka/Analytics.jar;

The above seems to have worked fine as I see the resource added but when I go 
ahead and create a function, I get the following error. Any ideas what the 
issue can be?

hive> create temporary function lag as 'com.example.hive.udf.Lag';
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask


Regards,


-----Original Message-----
From: Hamilton, Robert (Austin) [mailto:robert.hamil...@hp.com] 
Sent: 10 April 2012 20:32
To: user@hive.apache.org
Subject: RE: Lag function in Hive

You can write a custom UDF - 

Here is one that I have played around with, along with some test SQL. It comes 
with no warrantee :) 

Sorry I can't really share the test data, but hopefully you get the idea.  To 
run, compile the Lag class, jar it up into Analytics.jar, put the jar on the 
CLASSPATH (you may need to deploy to all the nodes on the cluster) and run the 
hive command below.

Note the "distribute by" and "sort by"  are critical.  Also the sub-select is 
just an artifice to make sure the UDF is running in the reducer (so that it is 
sorted).  Maybe the hive experts can suggest a better way for that to work...

#
# use live clickstream test data from 2012-01-12
#
hive -e "add jar Analytics.jar;

create temporary function lag as 'com.example.hive.udf.Lag';
select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) 
        from (select session_id,hit_datetime_gmt from omni2 where 
visit_day='2012-01-12' and session_id is not null 
        distribute by session_id 
        sort by session_id,hit_datetime_gmt ) X 
distribute by session_id limit 1000
"

------------------------ Contents of Lag.java 
-----------------------------------------
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public final class Lag extends UDF{
    private int  counter;
    private String last_key;
    private String lastGroup;
    private String return_value="";

    public String evaluate(String key, String groupKey){
        if(groupKey==null){
                this.last_key=null;
        }else
          if ( !groupKey.equalsIgnoreCase(this.lastGroup )) {
                this.last_key=null;
        }
     return_value=this.last_key;
     this.last_key = key;
     this.lastGroup=groupKey;
     return return_value;
    }
}

Result of test run:

1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:37     
NULL
1326326437-26270601625187049522752846106448274394       2012-01-12 00:00:59     
2012-01-12 00:00:37
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:05     
2012-01-12 00:00:59
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:07     
2012-01-12 00:01:05
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:11     
2012-01-12 00:01:07
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:12     
2012-01-12 00:01:11
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:24     
2012-01-12 00:01:12
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:32     
2012-01-12 00:01:24
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:45     
2012-01-12 00:01:32
1326326437-26270601625187049522752846106448274394       2012-01-12 00:01:48     
2012-01-12 00:01:45

-----Original Message-----
From: Philip Tromans [mailto:philip.j.trom...@gmail.com] 
Sent: Tuesday, April 10, 2012 9:18 AM
To: user@hive.apache.org
Subject: Re: Lag function in Hive

Hi Karan,

To the best of my knowledge, there isn't one. It's also unlikely to happen 
because it's hard to parallelise in a map-reduce way (it requires knowing where 
you are in a result set, and who your neighbours are and they in turn need to 
be present on the same node as you which is difficult to guarantee).

Cheers,

Phil.

On 10 April 2012 14:44,  <karanveer.si...@barclays.com> wrote:
> Hi,
>
> Is there something like a 'lag' function in HIVE? The requirement is 
> to calculate difference for the same column for every 2 subsequent records.
>
> For example.
>
> Row, Column A, Column B
> 1, 10, 100
> 2, 20, 200
> 3, 30, 300
>
>
> The result that I need should be like:
>
> Row, Column A, Column B, Result
> 1, 10, 100, NULL
> 2, 20, 200, 100 (200-100)
> 3, 30, 300, 100 (300-200)
>
> Rgds,
> Karan
>
>
>
>
>
> This e-mail and any attachments are confidential and intended solely 
> for the addressee and may also be privileged or exempt from disclosure 
> under applicable law. If you are not the addressee, or have received 
> this e-mail in error, please notify the sender immediately, delete it 
> from your system and do not copy, disclose or otherwise act upon any 
> part of this e-mail or its attachments.
>
> Internet communications are not guaranteed to be secure or virus-free.
> The Barclays Group does not accept responsibility for any loss arising 
> from unauthorised access to, or interference with, any Internet 
> communications by any third party, or from the transmission of any 
> viruses. Replies to this e-mail may be monitored by the Barclays Group 
> for operational or business reasons.
>
> Any opinion or other information in this e-mail or its attachments 
> that does not relate to the business of the Barclays Group is personal 
> to the sender and is not given or endorsed by the Barclays Group.
>
> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
>
> Barclays Bank PLC is authorised and regulated by the Financial 
> Services Authority.

RE: Lag function in Hive

Reply via email to