Re: Lag function in Hive

Philip Tromans Tue, 10 Apr 2012 08:03:26 -0700

I think you want something more like:

SELECT t2.value - t1.value
FROM mytable t1
JOIN mytable t2 ON (t1.rownum = t2.rownum + 1 AND t2.partition=bar)
WHERE t1.partition=foo;


This should be faster as partition selection will happen earlier.

This is still going to involve an awful lot of I/O, and not going to be fast.

Phil.


On 10 April 2012 15:56, David Kulp <dk...@fiksu.com> wrote:
> You have to explicitly request it in CREATE TABLE.  And you should generally 
> let hive perform the clustering -- i.e. don't use an external table with data 
> that is generated by some other process because it's hard to get the hash and 
> notation right.
> Check your table with "DESCRIBE FORMATTED tablename".
>
> On Apr 10, 2012, at 10:51 AM, <karanveer.si...@barclays.com> 
> <karanveer.si...@barclays.com> wrote:
>
>> Thanks - I will check this out.
>>
>> Meanwhile, would default clustering happen using rownum? How can I check on 
>> how is clustering happening in our environment?
>>
>> Rgds
>>
>> ----- Original Message -----
>> From: David Kulp <dk...@fiksu.com>
>> To: user@hive.apache.org <user@hive.apache.org>
>> Sent: Tue Apr 10 15:45:25 2012
>> Subject: Re: Lag function in Hive
>>
>> New here.  Hello all.
>>
>> Could you try a self-join, possibly also restricted to partitions?
>>
>> E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum 
>> = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar
>>
>> If your data is clustered by rownum, then this join should, in theory, be 
>> relatively fast -- especially if it makes sense to exploit partitions.
>>
>> -d
>>
>> On Apr 10, 2012, at 10:37 AM, <karanveer.si...@barclays.com> 
>> <karanveer.si...@barclays.com> wrote:
>>
>>> Makes sense but is not the distribution across nodes for a chunk of records 
>>> in that order.
>>>
>>> If Hive cannot help me do this, is there another way I can do this? I tried 
>>> generating an identifier using the perl script invoked using Hive but it 
>>> does not seem to work fine. While the stand alone script works fine, when 
>>> the record is created in hive using std output from perl - I see 2 records 
>>> for some of the unique identifiers. I explored the possibility of default 
>>> data type changes but that does not solve the problem.
>>>
>>> Regards,
>>> Karan
>>>
>>>
>>> -----Original Message-----
>>> From: Philip Tromans [mailto:philip.j.trom...@gmail.com]
>>> Sent: 10 April 2012 19:48
>>> To: user@hive.apache.org
>>> Subject: Re: Lag function in Hive
>>>
>>> Hi Karan,
>>>
>>> To the best of my knowledge, there isn't one. It's also unlikely to
>>> happen because it's hard to parallelise in a map-reduce way (it
>>> requires knowing where you are in a result set, and who your
>>> neighbours are and they in turn need to be present on the same node as
>>> you which is difficult to guarantee).
>>>
>>> Cheers,
>>>
>>> Phil.
>>>
>>> On 10 April 2012 14:44,  <karanveer.si...@barclays.com> wrote:
>>>> Hi,
>>>>
>>>> Is there something like a 'lag' function in HIVE? The requirement is to
>>>> calculate difference for the same column for every 2 subsequent records.
>>>>
>>>> For example.
>>>>
>>>> Row, Column A, Column B
>>>> 1, 10, 100
>>>> 2, 20, 200
>>>> 3, 30, 300
>>>>
>>>>
>>>> The result that I need should be like:
>>>>
>>>> Row, Column A, Column B, Result
>>>> 1, 10, 100, NULL
>>>> 2, 20, 200, 100 (200-100)
>>>> 3, 30, 300, 100 (300-200)
>>>>
>>>> Rgds,
>>>> Karan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> This e-mail and any attachments are confidential and intended solely for 
>>>> the
>>>> addressee and may also be privileged or exempt from disclosure under
>>>> applicable law. If you are not the addressee, or have received this e-mail
>>>> in error, please notify the sender immediately, delete it from your system
>>>> and do not copy, disclose or otherwise act upon any part of this e-mail or
>>>> its attachments.
>>>>
>>>> Internet communications are not guaranteed to be secure or virus-free.
>>>> The Barclays Group does not accept responsibility for any loss arising from
>>>> unauthorised access to, or interference with, any Internet communications 
>>>> by
>>>> any third party, or from the transmission of any viruses. Replies to this
>>>> e-mail may be monitored by the Barclays Group for operational or business
>>>> reasons.
>>>>
>>>> Any opinion or other information in this e-mail or its attachments that 
>>>> does
>>>> not relate to the business of the Barclays Group is personal to the sender
>>>> and is not given or endorsed by the Barclays Group.
>>>>
>>>> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
>>>> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
>>>>
>>>> Barclays Bank PLC is authorised and regulated by the Financial Services
>>>> Authority.
>>
>>
>> This e-mail and any attachments are confidential and intended
>> solely for the addressee and may also be privileged or exempt from
>> disclosure under applicable law. If you are not the addressee, or
>> have received this e-mail in error, please notify the sender
>> immediately, delete it from your system and do not copy, disclose
>> or otherwise act upon any part of this e-mail or its attachments.
>>
>> Internet communications are not guaranteed to be secure or
>> virus-free.
>> The Barclays Group does not accept responsibility for any loss
>> arising from unauthorised access to, or interference with, any
>> Internet communications by any third party, or from the
>> transmission of any viruses. Replies to this e-mail may be
>> monitored by the Barclays Group for operational or business
>> reasons.
>>
>> Any opinion or other information in this e-mail or its attachments
>> that does not relate to the business of the Barclays Group is
>> personal to the sender and is not given or endorsed by the Barclays
>> Group.
>>
>> Barclays Bank PLC. Registered in England and Wales (registered no.
>> 1026167).
>> Registered Office: 1 Churchill Place, London, E14 5HP, United
>> Kingdom.
>>
>> Barclays Bank PLC is authorised and regulated by the Financial
>> Services Authority.
>

Re: Lag function in Hive

Reply via email to