I think you want something more like: SELECT t2.value - t1.value FROM mytable t1 JOIN mytable t2 ON (t1.rownum = t2.rownum + 1 AND t2.partition=bar) WHERE t1.partition=foo;
This should be faster as partition selection will happen earlier. This is still going to involve an awful lot of I/O, and not going to be fast. Phil. On 10 April 2012 15:56, David Kulp <dk...@fiksu.com> wrote: > You have to explicitly request it in CREATE TABLE. And you should generally > let hive perform the clustering -- i.e. don't use an external table with data > that is generated by some other process because it's hard to get the hash and > notation right. > Check your table with "DESCRIBE FORMATTED tablename". > > On Apr 10, 2012, at 10:51 AM, <karanveer.si...@barclays.com> > <karanveer.si...@barclays.com> wrote: > >> Thanks - I will check this out. >> >> Meanwhile, would default clustering happen using rownum? How can I check on >> how is clustering happening in our environment? >> >> Rgds >> >> ----- Original Message ----- >> From: David Kulp <dk...@fiksu.com> >> To: user@hive.apache.org <user@hive.apache.org> >> Sent: Tue Apr 10 15:45:25 2012 >> Subject: Re: Lag function in Hive >> >> New here. Hello all. >> >> Could you try a self-join, possibly also restricted to partitions? >> >> E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum >> = t2.rownum+1 AND t1.partition=foo AND t2.partition=bar >> >> If your data is clustered by rownum, then this join should, in theory, be >> relatively fast -- especially if it makes sense to exploit partitions. >> >> -d >> >> On Apr 10, 2012, at 10:37 AM, <karanveer.si...@barclays.com> >> <karanveer.si...@barclays.com> wrote: >> >>> Makes sense but is not the distribution across nodes for a chunk of records >>> in that order. >>> >>> If Hive cannot help me do this, is there another way I can do this? I tried >>> generating an identifier using the perl script invoked using Hive but it >>> does not seem to work fine. While the stand alone script works fine, when >>> the record is created in hive using std output from perl - I see 2 records >>> for some of the unique identifiers. I explored the possibility of default >>> data type changes but that does not solve the problem. >>> >>> Regards, >>> Karan >>> >>> >>> -----Original Message----- >>> From: Philip Tromans [mailto:philip.j.trom...@gmail.com] >>> Sent: 10 April 2012 19:48 >>> To: user@hive.apache.org >>> Subject: Re: Lag function in Hive >>> >>> Hi Karan, >>> >>> To the best of my knowledge, there isn't one. It's also unlikely to >>> happen because it's hard to parallelise in a map-reduce way (it >>> requires knowing where you are in a result set, and who your >>> neighbours are and they in turn need to be present on the same node as >>> you which is difficult to guarantee). >>> >>> Cheers, >>> >>> Phil. >>> >>> On 10 April 2012 14:44, <karanveer.si...@barclays.com> wrote: >>>> Hi, >>>> >>>> Is there something like a 'lag' function in HIVE? The requirement is to >>>> calculate difference for the same column for every 2 subsequent records. >>>> >>>> For example. >>>> >>>> Row, Column A, Column B >>>> 1, 10, 100 >>>> 2, 20, 200 >>>> 3, 30, 300 >>>> >>>> >>>> The result that I need should be like: >>>> >>>> Row, Column A, Column B, Result >>>> 1, 10, 100, NULL >>>> 2, 20, 200, 100 (200-100) >>>> 3, 30, 300, 100 (300-200) >>>> >>>> Rgds, >>>> Karan >>>> >>>> >>>> >>>> >>>> >>>> This e-mail and any attachments are confidential and intended solely for >>>> the >>>> addressee and may also be privileged or exempt from disclosure under >>>> applicable law. If you are not the addressee, or have received this e-mail >>>> in error, please notify the sender immediately, delete it from your system >>>> and do not copy, disclose or otherwise act upon any part of this e-mail or >>>> its attachments. >>>> >>>> Internet communications are not guaranteed to be secure or virus-free. >>>> The Barclays Group does not accept responsibility for any loss arising from >>>> unauthorised access to, or interference with, any Internet communications >>>> by >>>> any third party, or from the transmission of any viruses. Replies to this >>>> e-mail may be monitored by the Barclays Group for operational or business >>>> reasons. >>>> >>>> Any opinion or other information in this e-mail or its attachments that >>>> does >>>> not relate to the business of the Barclays Group is personal to the sender >>>> and is not given or endorsed by the Barclays Group. >>>> >>>> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167). >>>> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom. >>>> >>>> Barclays Bank PLC is authorised and regulated by the Financial Services >>>> Authority. >> >> >> This e-mail and any attachments are confidential and intended >> solely for the addressee and may also be privileged or exempt from >> disclosure under applicable law. If you are not the addressee, or >> have received this e-mail in error, please notify the sender >> immediately, delete it from your system and do not copy, disclose >> or otherwise act upon any part of this e-mail or its attachments. >> >> Internet communications are not guaranteed to be secure or >> virus-free. >> The Barclays Group does not accept responsibility for any loss >> arising from unauthorised access to, or interference with, any >> Internet communications by any third party, or from the >> transmission of any viruses. Replies to this e-mail may be >> monitored by the Barclays Group for operational or business >> reasons. >> >> Any opinion or other information in this e-mail or its attachments >> that does not relate to the business of the Barclays Group is >> personal to the sender and is not given or endorsed by the Barclays >> Group. >> >> Barclays Bank PLC. Registered in England and Wales (registered no. >> 1026167). >> Registered Office: 1 Churchill Place, London, E14 5HP, United >> Kingdom. >> >> Barclays Bank PLC is authorised and regulated by the Financial >> Services Authority. >