Hi Igor, I am new to HiveQL world. Don't know that much basically. Currently I have my Rank UDF function like this-
*public final class Rank extends UDF{* * private int counter;* * private String last_key;* * public int evaluate(final String key){* * if ( !key.equalsIgnoreCase(this.last_key) ) {* * this.counter = 0;* * this.last_key = key;* * }* * return this.counter++;* * }* *}* * * And I tried that query after removing pid from distribute by and sort by clause, but I got the below output which is wrong again- *1345653 330760137950 0* *1345653 330760137950 1* *1345653 330760137950 2* But I need output something like this- *1345653 330760137950 2012-07-09 21:43:29* *1345653 330760137950 2012-07-09 21:42:29* *1345653 330760137950 2012-07-09 21:41:29* 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 Any help will be appreciated. On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov <i...@decide.com> wrote: > Actually, never mind. Looks like you need to partition by both bid and > pid. In that case, your problem is that rank() has to handle a combined > bid+pid key. So first you need to create a combined key, partition by that > key and pass it to your rank() function (assuming rank() knows to reset on > a new key). You can cast bid and pid to string and concatenate them with a > separator (bid_pid) to get a single partitioning key. Hope this makes sense. > > On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <i...@decide.com> wrote: > >> Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses. >> Your very first query was correct except for the nested subquery part. >> (You don't need a double-nested subquery.) >> >> On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky >> <comptechge...@gmail.com>wrote: >> >>> Hi Igor, >>> >>> I am not sure what I have to remove from Distribute By as in distribute >>> by we have bid, pid and you said remove bid and time from distribute by and >>> it doesn't have time >>> >>> *SELECT bid, pid, rank FROM * >>> *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) >>> FROM >>> * >>> * >>> ( SELECT bid, pid, time FROM table1* >>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = >>> '2012-07-09' * >>> * DISTRIBUTE BY bid,pid* >>> * SORT BY bid,pid, time desc) A >>> * >>> * >>> ) B >>> * >>> *WHERE rank < 3;* >>> >>> >>> And also I tried running the above query as it is. I am not getting >>> expected output instead of that I am getting output like this which is >>> wrong If you compare my expected output with the below output- >>> >>> *1345653 110909316904 0* >>> *1345653 221065796761 1* >>> *1345653 330760137950 2* >>> >>> >>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <i...@decide.com> wrote: >>> >>>> Remove pid,time from DISTRIBUTE BY. >>>> >>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky < >>>> comptechge...@gmail.com> wrote: >>>> >>>>> Modified Query that I wrote and its not working as expected output is. >>>>> >>>>> * >>>>> * >>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* >>>>> *FROM (* >>>>> * SELECT bid, pid, time* >>>>> * FROM table1* >>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = >>>>> '2012-07-09'* >>>>> * DISTRIBUTE BY bid,pid,time* >>>>> * SORT BY bid, time desc* >>>>> *) a* >>>>> *WHERE rank(bid) < 3;* >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky < >>>>> comptechge...@gmail.com> wrote: >>>>> >>>>>> I wrote this query after modifying it- >>>>>> >>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time, >>>>>> UNIX_TIMESTAMP(created_time)* >>>>>> *FROM (* >>>>>> * SELECT buyer_id, item_id, created_time* >>>>>> * FROM testingtable1* >>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as >>>>>> int))) = '2012-07-09'* >>>>>> * DISTRIBUTE BY buyer_id,item_id* >>>>>> * SORT BY buyer_id, created_time desc* >>>>>> *) a* >>>>>> *WHERE rank(buyer_id) < 3;* >>>>>> >>>>>> And the output I got is which is sligtly wrong as it is missing two >>>>>> rows- >>>>>> >>>>>> *1345653 330760137950 2012-07-09 21:43:29* >>>>>> *1345653 330760137950 2012-07-09 21:42:29* >>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>> >>>>>> These two rows are missing- >>>>>> >>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>> >>>>>> So full output should be like this- >>>>>> >>>>>> *1345653 330760137950 2012-07-09 21:43:29* >>>>>> *1345653 330760137950 2012-07-09 21:42:29* >>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>> >>>>>> >>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky < >>>>>> comptechge...@gmail.com> wrote: >>>>>> >>>>>>> Can you show me the exact query that I need to do for this >>>>>>> particular problem consideing my scenario? It will be of great help to >>>>>>> me. >>>>>>> As I am new to HiveQL. >>>>>>> >>>>>>> I need TOP 3 for those if BID and PID gets matched but with >>>>>>> different timestamp. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans < >>>>>>> philip.j.trom...@gmail.com> wrote: >>>>>>> >>>>>>>> Your rank() is being evaluated map side. Put your distribute by and >>>>>>>> sort by in an inner query, and then evaluate your rank() in an outer >>>>>>>> query. >>>>>>>> >>>>>>>> Phil. >>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <comptechge...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This is the below data in my Table1 >>>>>>>>> >>>>>>>>> >>>>>>>>> BID PID TIME >>>>>>>>> >>>>>>>>> --------------+-------------------------+-------------------------------- >>>>>>>>> 1345653 330760137950 2012-07-09 21:42:29 >>>>>>>>> 1345653 330760137950 2012-07-09 21:43:29 >>>>>>>>> 1345653 330760137950 2012-07-09 21:40:29 >>>>>>>>> 1345653 330760137950 2012-07-09 21:41:29 >>>>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>>>> >>>>>>>>> So If I need to clarify the above scenario- I have data in above >>>>>>>>> table like this- >>>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times >>>>>>>>> but with different timestamps in red color. So I need the output >>>>>>>>> something >>>>>>>>> like this- >>>>>>>>> >>>>>>>>> Output that I need:- >>>>>>>>> >>>>>>>>> *1345653 330760137950 2012-07-09 21:43:29 * >>>>>>>>> *1345653 330760137950 2012-07-09 21:42:29 * >>>>>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>>>> >>>>>>>>> So Basically If BID and PID are same but with different >>>>>>>>> timestamps, then I need TOP 3 sorted with TIME in descending order >>>>>>>>> >>>>>>>>> And for this I created rank UDF (User Defined Function). And I >>>>>>>>> wrote the below query but its not working for me. Can anyone help me >>>>>>>>> on >>>>>>>>> this? >>>>>>>>> * >>>>>>>>> * >>>>>>>>> * >>>>>>>>> * >>>>>>>>> *SELECT buyer_id, item_id, created_time* >>>>>>>>> * FROM table1* >>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) >>>>>>>>> as int))) = '2012-07-09'* >>>>>>>>> * DISTRIBUTE BY buyer_id* >>>>>>>>> * SORT BY buyer_id, created_time desc* >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >