In that case, wouldn't this work:
SELECT buyer_id, item_id, rank(buyer_id), created_time
FROM (
SELECT buyer_id, item_id, created_time
FROM testingtable1
DISTRIBUTE BY buyer_id
SORT BY buyer_id, created_time desc
) a
WHERE rank(buyer_id) < 10;
On Tue, Jul 10, 2012 at 4:21 PM, Raihan Jamal <[email protected]> wrote:
> So the OUTPUT should be like this everything sorted in descending order with
> time and only top 10 for each BUYER_ID-
>
> Sample Output.
>
> BUYER_ID | ITEM_ID | CREATED_TIME
>
>
> ------------+------------------+-----------------------
>
>
>
> 34512201 2412180494 2012-07-09 22:21:11
>
>
> 34512201 1875744030 2012-07-09 19:40:39
>
>
> 34512201 597245693 2012-07-09 16:20:21
>
>
> 34512201 8071787728 2012-07-09 15:19:59
>
>
> 34512201 959188449 2012-07-09 08:48:47
>
>
> 34512201 5868222883 2012-07-09 08:23:17
>
>
> 34512201 5656232360 2012-07-09 07:28:47
>
>
> 34512201 5639158173 2012-07-09 06:58:47
>
>
> 34512201 4645350592 2012-07-09 06:38:47
>
>
> 34512201 2422054205 2012-07-09 06:09:56
>
>
> 1015826235 210002448035 2012-07-09 22:21:11
>
>
> 1015826235 260003553382 2012-07-09 19:40:39
>
>
> 1015826235 220003038067 2012-07-09 19:40:21
>
>
> 1015826235 300003861266 2012-07-09 18:19:59
>
>
> 1015826235 140002997245 2012-07-09 09:23:17
>
>
> 1015826235 260003553385 2012-07-09 08:48:47
>
>
> 1015826235 260003553384 2012-07-09 07:28:47
>
>
> 1015826235 260003553381 2012-07-09 07:09:56
>
>
> 1015826235 260003553383 2012-07-09 06:58:47
>
>
> 1015826235 260003553389 2012-07-09 06:54:37
>
>
>
>
>
> Raihan Jamal
>
>
>
> On Tue, Jul 10, 2012 at 4:18 PM, Raihan Jamal <[email protected]> wrote:
>>
>> Thanks Vijay for reply. But it doesn't works out the way I needed. I am
>> getting full data back for each BUYER_ID. Let me explain you more.
>>
>> This is the below data in the table and you can see in the below data this
>> BUYER_ID - 1015826235 appears 13 times and this BUYER_ID -34512201 appears
>> 12 times, And I need TOP 10 for each of the BUYER_ID BASIS on time-
>>
>> BUYER_ID | ITEM_ID | CREATED_TIME
>>
>>
>>
>> ------------+------------------+-----------------------
>>
>>
>>
>> 1015826235 220003038067 2012-07-09 19:40:21,
>>
>>
>>
>> 1015826235 300003861266 2012-07-09 18:19:59,
>>
>>
>>
>> 1015826235 140002997245 2012-07-09 09:23:17,
>>
>>
>>
>> 1015826235 210002448035 2012-07-09 22:21:11,
>>
>>
>>
>> 1015826235 260003553381 2012-07-09 07:09:56,
>>
>>
>>
>> 1015826235 260003553382 2012-07-09 19:40:39,
>>
>>
>>
>> 1015826235 260003553383 2012-07-09 06:58:47,
>>
>>
>>
>> 1015826235 260003553384 2012-07-09 07:28:47,
>>
>>
>>
>> 1015826235 260003553385 2012-07-09 08:48:47,
>>
>>
>>
>> 1015826235 260003553386 2012-07-09 06:38:47,
>>
>>
>>
>> 1015826235 260003553387 2012-07-09 05:38:47,
>>
>>
>>
>> 1015826235 260003553388 2012-07-09 04:55:47,
>>
>>
>>
>> 1015826235 260003553389 2012-07-09 06:54:37,
>>
>>
>>
>> 34512201 597245693 2012-07-09 16:20:21,
>>
>>
>>
>> 34512201 8071787728 2012-07-09 15:19:59,
>>
>>
>>
>> 34512201 5868222883 2012-07-09 08:23:17,
>>
>>
>>
>> 34512201 2412180494 2012-07-09 22:21:11,
>>
>>
>>
>> 34512201 2422054205 2012-07-09 06:09:56,
>>
>>
>>
>> 34512201 1875744030 2012-07-09 19:40:39,
>>
>>
>>
>> 34512201 5639158173 2012-07-09 06:58:47,
>>
>>
>>
>> 34512201 5656232360 2012-07-09 07:28:47,
>>
>>
>>
>> 34512201 959188449 2012-07-09 08:48:47,
>>
>>
>>
>> 34512201 4645350592 2012-07-09 06:38:47,
>>
>>
>>
>> 34512201 5657320532 2012-07-09 05:38:47,
>>
>>
>>
>> 34512201 290419656539 2012-07-09 04:55:47,
>>
>>
>> So basically if I do it like this-
>>
>> SELECT * FROM TestingTable1 ORDER BY buyer_id, created_time DESC;
>>
>> everything will get sorted in descending order for each BUYER_ID and then
>> I need to pick up TOP 10 rows for each BUYER_ID.
>>
>> And with your query that you just said, it's not working. I am getting
>> full data back like this. Below is the result I got from your query-
>>
>> 34512201 597245693 0 2012-07-09 16:20:21
>> 34512201 959188449 0 2012-07-09 08:48:47
>> 34512201 1875744030 0 2012-07-09 19:40:39
>> 34512201 2412180494 0 2012-07-09 22:21:11
>> 34512201 2422054205 0 2012-07-09 06:09:56
>> 34512201 4645350592 0 2012-07-09 06:38:47
>> 34512201 5639158173 0 2012-07-09 06:58:47
>> 34512201 5656232360 0 2012-07-09 07:28:47
>> 34512201 5657320532 0 2012-07-09 05:38:47
>> 34512201 5868222883 0 2012-07-09 08:23:17
>> 34512201 8071787728 0 2012-07-09 15:19:59
>> 34512201 290419656539 0 2012-07-09 04:55:47
>> 1015826235 140002997245 0 2012-07-09 09:23:17
>> 1015826235 210002448035 0 2012-07-09 22:21:11
>> 1015826235 220003038067 0 2012-07-09 19:40:21
>> 1015826235 260003553381 0 2012-07-09 07:09:56
>> 1015826235 260003553382 0 2012-07-09 19:40:39
>> 1015826235 260003553383 0 2012-07-09 06:58:47
>> 1015826235 260003553384 0 2012-07-09 07:28:47
>> 1015826235 260003553385 0 2012-07-09 08:48:47
>> 1015826235 260003553386 0 2012-07-09 06:38:47
>> 1015826235 260003553387 0 2012-07-09 05:38:47
>> 1015826235 260003553388 0 2012-07-09 04:55:47
>> 1015826235 260003553389 0 2012-07-09 06:54:37
>> 1015826235 300003861266 0 2012-07-09 18:19:59
>>
>>
>> This is my Rank function that I have created-
>>
>> package com.example.hive.udf;
>> import org.apache.hadoop.hive.ql.exec.UDF;
>>
>> public final class Rank extends UDF{
>> private int counter;
>> private String last_key;
>> public int evaluate(final String key){
>> if ( !key.equalsIgnoreCase(this.last_key) ) {
>> this.counter = 0;
>> this.last_key = key;
>> }
>> return this.counter++;
>> }
>> }
>>
>>
>> Raihan Jamal
>>
>>
>>
>> On Tue, Jul 10, 2012 at 4:01 PM, Vijay <[email protected]> wrote:
>>>
>>> This is a little tricky but this is how it works:
>>>
>>> SELECT buyer_id, item_id, rank(item_id), created_time
>>> FROM (
>>> SELECT buyer_id, item_id, created_time
>>> FROM testingtable1
>>> DISTRIBUTE BY buyer_id, item_id
>>> SORT BY buyer_id, item_id, created_time desc
>>> ) a
>>> WHERE rank(item_id) < 10;
>>>
>>>
>>> On Tue, Jul 10, 2012 at 12:31 PM, Raihan Jamal <[email protected]>
>>> wrote:
>>> > Still it's not working with the use of my rank UDF. Below is the query
>>> > I am
>>> > using
>>> >
>>> > Can anyone help me, what changes I need to make in my below sql query?
>>> >
>>> >
>>> > CREATE TABLE IF NOT EXISTS TestingTable1
>>> >
>>> >
>>> > (
>>> >
>>> >
>>> > BUYER_ID BIGINT,
>>> >
>>> >
>>> > ITEM_ID BIGINT,
>>> >
>>> >
>>> > CREATED_TIME STRING
>>> > )
>>> >
>>> >
>>> > Find TOP 10 latest data (ITEM_ID, CREATED_TIME) for each BUYER_ID for
>>> > yesterday's date by sorting the created_time in descending order.
>>> >
>>> >
>>> > So what I was thinking is that, with the use of this below query, I
>>> > will be
>>> > getting-
>>> >
>>> >
>>> > SELECT * FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC;
>>> >
>>> >
>>> > All the BUYER_ID and its ITEM_ID corresponding to CREATED_TIME in
>>> > Descending
>>> > order, But how I can pick TOP 10 for each BUYER_ID.?
>>> >
>>> > This is my SQL Query that I am using with the use of rank UDF.
>>> >
>>> > SELECT buyer_id, item_id, created_time, rk
>>> > FROM (
>>> > SELECT buyer_id, item_id, rank(item_id) as rk, created_time
>>> > FROM testingtable1
>>> > DISTRIBUTE BY buyer_id, item_id
>>> > SORT BY buyer_id, item_id, created_time desc
>>> > ) a
>>> > WHERE rk < 10
>>> > ORDER BY buyer_id, created_time, rk;
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Raihan Jamal
>>> >
>>> >
>>> >
>>> > On Tue, Jul 10, 2012 at 12:16 AM, Jasper Knulst
>>> > <[email protected]>
>>> > wrote:
>>> >>
>>> >> Hi Raihan,
>>> >>
>>> >> You should use 'rank(buyer_id)' in the order by clause on line 9 in
>>> >> stead
>>> >> of the alias 'rk'. I had the same problem, strangely, the alias is not
>>> >> resolved when it is in the order by clause.
>>> >>
>>> >> Other thing, I had some issues when I used this exact same set up for
>>> >> ranking results, that the rank UDF was already implemented at the map
>>> >> phase.
>>> >> Then you get very strange results. You have to introduce an extra
>>> >> subquery
>>> >> for it to work.
>>> >>
>>> >> Jasper
>>> >>
>>> >> Op 10 jul. 2012 09:01 schreef "Raihan Jamal" <[email protected]>
>>> >> het
>>> >> volgende:
>>> >>
>>> >>> I was not able to understand, This is my below qiuery that I am using
>>> >>> currently-
>>> >>>
>>> >>> SELECT buyer_id, item_id, created_time
>>> >>> FROM (
>>> >>> SELECT buyer_id, item_id, rank(buyer_id) as rank, created_time
>>> >>> FROM testingtable1
>>> >>> DISTRIBUTE BY buyer_id, item_id
>>> >>> SORT BY buyer_id, item_id, created_time desc
>>> >>> ) a
>>> >>> WHERE rank < 10
>>> >>> ORDER BY buyer_id, created_time, rank;
>>> >>>
>>> >>>
>>> >>> What changes I need to make?
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> Raihan Jamal
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Mon, Jul 9, 2012 at 11:52 PM, Nitin Pawar
>>> >>> <[email protected]>
>>> >>> wrote:
>>> >>>>
>>> >>>> try rk in upper select statement as well
>>> >>>>
>>> >>>>
>>> >>>> On Tue, Jul 10, 2012 at 12:12 PM, Raihan Jamal
>>> >>>> <[email protected]>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Thanks for commenting out. Yes I figured that out, its a UDF. So
>>> >>>>> now I
>>> >>>>> have created a new UDF Rank and added to classpath also. But when I
>>> >>>>> am again
>>> >>>>> running the below query-
>>> >>>>>
>>> >>>>> SELECT buyer_id, item_id, created_time
>>> >>>>> FROM (
>>> >>>>> SELECT buyer_id, item_id, Rank(buyer_id) as rk, created_time
>>> >>>>> FROM testingtable1
>>> >>>>> DISTRIBUTE BY buyer_id, item_id
>>> >>>>> SORT BY buyer_id, item_id, created_time desc
>>> >>>>> ) a
>>> >>>>> WHERE rk < 10
>>> >>>>> ORDER BY buyer_id, rk;
>>> >>>>>
>>> >>>>> I am getting this below error-
>>> >>>>>
>>> >>>>> FAILED: Error in semantic analysis: line 9:19 Invalid Table Alias
>>> >>>>> or
>>> >>>>> Column Reference rk
>>> >>>>>
>>> >>>>>
>>> >>>>> Why is it so? Any suggestions?
>>> >>>>>
>>> >>>>>
>>> >>>>> Raihan Jamal
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Mon, Jul 9, 2012 at 10:51 PM, Vijay <[email protected]> wrote:
>>> >>>>>>
>>> >>>>>> hive has no built-in rank function. you'd need to use a
>>> >>>>>> user-defined
>>> >>>>>> function (UDF) to simulate it. there are a few custom
>>> >>>>>> implementations
>>> >>>>>> on the net that you can leverage.
>>> >>>>>>
>>> >>>>>> On Mon, Jul 9, 2012 at 10:40 PM, Raihan Jamal
>>> >>>>>> <[email protected]>
>>> >>>>>> wrote:
>>> >>>>>> > What's wrong with the below query.
>>> >>>>>> >
>>> >>>>>> >
>>> >>>>>> > SELECT buyer_id, item_id, created_time
>>> >>>>>> > FROM (
>>> >>>>>> > SELECT buyer_id, item_id, rank(buyer_id) as rank,
>>> >>>>>> > created_time
>>> >>>>>> > FROM testingtable1
>>> >>>>>> > DISTRIBUTE BY buyer_id, item_id
>>> >>>>>> > SORT BY buyer_id, item_id, created_time desc
>>> >>>>>> > ) a
>>> >>>>>> > WHERE rank < 10
>>> >>>>>> > ORDER BY buyer_id, rank
>>> >>>>>> >
>>> >>>>>> >
>>> >>>>>> > I am always getting as -
>>> >>>>>> >
>>> >>>>>> > FAILED: Error in semantic analysis: line 3:30 Invalid Function
>>> >>>>>> > rank
>>> >>>>>> >
>>> >>>>>> >
>>> >>>>>> > I am using Hive 0.6 I guess.
>>> >>>>>> >
>>> >>>>>> >
>>> >>>>>> >
>>> >>>>>> > Raihan Jamal
>>> >>>>>> >
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Nitin Pawar
>>> >>>>
>>> >>>
>>> >
>>
>>
>