In that case, wouldn't this work: SELECT buyer_id, item_id, rank(buyer_id), created_time FROM ( SELECT buyer_id, item_id, created_time FROM testingtable1 DISTRIBUTE BY buyer_id SORT BY buyer_id, created_time desc ) a WHERE rank(buyer_id) < 10;
On Tue, Jul 10, 2012 at 4:21 PM, Raihan Jamal <jamalrai...@gmail.com> wrote: > So the OUTPUT should be like this everything sorted in descending order with > time and only top 10 for each BUYER_ID- > > Sample Output. > > BUYER_ID | ITEM_ID | CREATED_TIME > > > ------------+------------------+----------------------- > > > > 34512201 2412180494 2012-07-09 22:21:11 > > > 34512201 1875744030 2012-07-09 19:40:39 > > > 34512201 597245693 2012-07-09 16:20:21 > > > 34512201 8071787728 2012-07-09 15:19:59 > > > 34512201 959188449 2012-07-09 08:48:47 > > > 34512201 5868222883 2012-07-09 08:23:17 > > > 34512201 5656232360 2012-07-09 07:28:47 > > > 34512201 5639158173 2012-07-09 06:58:47 > > > 34512201 4645350592 2012-07-09 06:38:47 > > > 34512201 2422054205 2012-07-09 06:09:56 > > > 1015826235 210002448035 2012-07-09 22:21:11 > > > 1015826235 260003553382 2012-07-09 19:40:39 > > > 1015826235 220003038067 2012-07-09 19:40:21 > > > 1015826235 300003861266 2012-07-09 18:19:59 > > > 1015826235 140002997245 2012-07-09 09:23:17 > > > 1015826235 260003553385 2012-07-09 08:48:47 > > > 1015826235 260003553384 2012-07-09 07:28:47 > > > 1015826235 260003553381 2012-07-09 07:09:56 > > > 1015826235 260003553383 2012-07-09 06:58:47 > > > 1015826235 260003553389 2012-07-09 06:54:37 > > > > > > Raihan Jamal > > > > On Tue, Jul 10, 2012 at 4:18 PM, Raihan Jamal <jamalrai...@gmail.com> wrote: >> >> Thanks Vijay for reply. But it doesn't works out the way I needed. I am >> getting full data back for each BUYER_ID. Let me explain you more. >> >> This is the below data in the table and you can see in the below data this >> BUYER_ID - 1015826235 appears 13 times and this BUYER_ID -34512201 appears >> 12 times, And I need TOP 10 for each of the BUYER_ID BASIS on time- >> >> BUYER_ID | ITEM_ID | CREATED_TIME >> >> >> >> ------------+------------------+----------------------- >> >> >> >> 1015826235 220003038067 2012-07-09 19:40:21, >> >> >> >> 1015826235 300003861266 2012-07-09 18:19:59, >> >> >> >> 1015826235 140002997245 2012-07-09 09:23:17, >> >> >> >> 1015826235 210002448035 2012-07-09 22:21:11, >> >> >> >> 1015826235 260003553381 2012-07-09 07:09:56, >> >> >> >> 1015826235 260003553382 2012-07-09 19:40:39, >> >> >> >> 1015826235 260003553383 2012-07-09 06:58:47, >> >> >> >> 1015826235 260003553384 2012-07-09 07:28:47, >> >> >> >> 1015826235 260003553385 2012-07-09 08:48:47, >> >> >> >> 1015826235 260003553386 2012-07-09 06:38:47, >> >> >> >> 1015826235 260003553387 2012-07-09 05:38:47, >> >> >> >> 1015826235 260003553388 2012-07-09 04:55:47, >> >> >> >> 1015826235 260003553389 2012-07-09 06:54:37, >> >> >> >> 34512201 597245693 2012-07-09 16:20:21, >> >> >> >> 34512201 8071787728 2012-07-09 15:19:59, >> >> >> >> 34512201 5868222883 2012-07-09 08:23:17, >> >> >> >> 34512201 2412180494 2012-07-09 22:21:11, >> >> >> >> 34512201 2422054205 2012-07-09 06:09:56, >> >> >> >> 34512201 1875744030 2012-07-09 19:40:39, >> >> >> >> 34512201 5639158173 2012-07-09 06:58:47, >> >> >> >> 34512201 5656232360 2012-07-09 07:28:47, >> >> >> >> 34512201 959188449 2012-07-09 08:48:47, >> >> >> >> 34512201 4645350592 2012-07-09 06:38:47, >> >> >> >> 34512201 5657320532 2012-07-09 05:38:47, >> >> >> >> 34512201 290419656539 2012-07-09 04:55:47, >> >> >> So basically if I do it like this- >> >> SELECT * FROM TestingTable1 ORDER BY buyer_id, created_time DESC; >> >> everything will get sorted in descending order for each BUYER_ID and then >> I need to pick up TOP 10 rows for each BUYER_ID. >> >> And with your query that you just said, it's not working. I am getting >> full data back like this. Below is the result I got from your query- >> >> 34512201 597245693 0 2012-07-09 16:20:21 >> 34512201 959188449 0 2012-07-09 08:48:47 >> 34512201 1875744030 0 2012-07-09 19:40:39 >> 34512201 2412180494 0 2012-07-09 22:21:11 >> 34512201 2422054205 0 2012-07-09 06:09:56 >> 34512201 4645350592 0 2012-07-09 06:38:47 >> 34512201 5639158173 0 2012-07-09 06:58:47 >> 34512201 5656232360 0 2012-07-09 07:28:47 >> 34512201 5657320532 0 2012-07-09 05:38:47 >> 34512201 5868222883 0 2012-07-09 08:23:17 >> 34512201 8071787728 0 2012-07-09 15:19:59 >> 34512201 290419656539 0 2012-07-09 04:55:47 >> 1015826235 140002997245 0 2012-07-09 09:23:17 >> 1015826235 210002448035 0 2012-07-09 22:21:11 >> 1015826235 220003038067 0 2012-07-09 19:40:21 >> 1015826235 260003553381 0 2012-07-09 07:09:56 >> 1015826235 260003553382 0 2012-07-09 19:40:39 >> 1015826235 260003553383 0 2012-07-09 06:58:47 >> 1015826235 260003553384 0 2012-07-09 07:28:47 >> 1015826235 260003553385 0 2012-07-09 08:48:47 >> 1015826235 260003553386 0 2012-07-09 06:38:47 >> 1015826235 260003553387 0 2012-07-09 05:38:47 >> 1015826235 260003553388 0 2012-07-09 04:55:47 >> 1015826235 260003553389 0 2012-07-09 06:54:37 >> 1015826235 300003861266 0 2012-07-09 18:19:59 >> >> >> This is my Rank function that I have created- >> >> package com.example.hive.udf; >> import org.apache.hadoop.hive.ql.exec.UDF; >> >> public final class Rank extends UDF{ >> private int counter; >> private String last_key; >> public int evaluate(final String key){ >> if ( !key.equalsIgnoreCase(this.last_key) ) { >> this.counter = 0; >> this.last_key = key; >> } >> return this.counter++; >> } >> } >> >> >> Raihan Jamal >> >> >> >> On Tue, Jul 10, 2012 at 4:01 PM, Vijay <tec...@gmail.com> wrote: >>> >>> This is a little tricky but this is how it works: >>> >>> SELECT buyer_id, item_id, rank(item_id), created_time >>> FROM ( >>> SELECT buyer_id, item_id, created_time >>> FROM testingtable1 >>> DISTRIBUTE BY buyer_id, item_id >>> SORT BY buyer_id, item_id, created_time desc >>> ) a >>> WHERE rank(item_id) < 10; >>> >>> >>> On Tue, Jul 10, 2012 at 12:31 PM, Raihan Jamal <jamalrai...@gmail.com> >>> wrote: >>> > Still it's not working with the use of my rank UDF. Below is the query >>> > I am >>> > using >>> > >>> > Can anyone help me, what changes I need to make in my below sql query? >>> > >>> > >>> > CREATE TABLE IF NOT EXISTS TestingTable1 >>> > >>> > >>> > ( >>> > >>> > >>> > BUYER_ID BIGINT, >>> > >>> > >>> > ITEM_ID BIGINT, >>> > >>> > >>> > CREATED_TIME STRING >>> > ) >>> > >>> > >>> > Find TOP 10 latest data (ITEM_ID, CREATED_TIME) for each BUYER_ID for >>> > yesterday's date by sorting the created_time in descending order. >>> > >>> > >>> > So what I was thinking is that, with the use of this below query, I >>> > will be >>> > getting- >>> > >>> > >>> > SELECT * FROM TestingTable1 WHERE ORDER BY buyer_id, created_time DESC; >>> > >>> > >>> > All the BUYER_ID and its ITEM_ID corresponding to CREATED_TIME in >>> > Descending >>> > order, But how I can pick TOP 10 for each BUYER_ID.? >>> > >>> > This is my SQL Query that I am using with the use of rank UDF. >>> > >>> > SELECT buyer_id, item_id, created_time, rk >>> > FROM ( >>> > SELECT buyer_id, item_id, rank(item_id) as rk, created_time >>> > FROM testingtable1 >>> > DISTRIBUTE BY buyer_id, item_id >>> > SORT BY buyer_id, item_id, created_time desc >>> > ) a >>> > WHERE rk < 10 >>> > ORDER BY buyer_id, created_time, rk; >>> > >>> > >>> > >>> > >>> > >>> > Raihan Jamal >>> > >>> > >>> > >>> > On Tue, Jul 10, 2012 at 12:16 AM, Jasper Knulst >>> > <jasper.knu...@incentro.com> >>> > wrote: >>> >> >>> >> Hi Raihan, >>> >> >>> >> You should use 'rank(buyer_id)' in the order by clause on line 9 in >>> >> stead >>> >> of the alias 'rk'. I had the same problem, strangely, the alias is not >>> >> resolved when it is in the order by clause. >>> >> >>> >> Other thing, I had some issues when I used this exact same set up for >>> >> ranking results, that the rank UDF was already implemented at the map >>> >> phase. >>> >> Then you get very strange results. You have to introduce an extra >>> >> subquery >>> >> for it to work. >>> >> >>> >> Jasper >>> >> >>> >> Op 10 jul. 2012 09:01 schreef "Raihan Jamal" <jamalrai...@gmail.com> >>> >> het >>> >> volgende: >>> >> >>> >>> I was not able to understand, This is my below qiuery that I am using >>> >>> currently- >>> >>> >>> >>> SELECT buyer_id, item_id, created_time >>> >>> FROM ( >>> >>> SELECT buyer_id, item_id, rank(buyer_id) as rank, created_time >>> >>> FROM testingtable1 >>> >>> DISTRIBUTE BY buyer_id, item_id >>> >>> SORT BY buyer_id, item_id, created_time desc >>> >>> ) a >>> >>> WHERE rank < 10 >>> >>> ORDER BY buyer_id, created_time, rank; >>> >>> >>> >>> >>> >>> What changes I need to make? >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Raihan Jamal >>> >>> >>> >>> >>> >>> >>> >>> On Mon, Jul 9, 2012 at 11:52 PM, Nitin Pawar >>> >>> <nitinpawar...@gmail.com> >>> >>> wrote: >>> >>>> >>> >>>> try rk in upper select statement as well >>> >>>> >>> >>>> >>> >>>> On Tue, Jul 10, 2012 at 12:12 PM, Raihan Jamal >>> >>>> <jamalrai...@gmail.com> >>> >>>> wrote: >>> >>>>> >>> >>>>> Thanks for commenting out. Yes I figured that out, its a UDF. So >>> >>>>> now I >>> >>>>> have created a new UDF Rank and added to classpath also. But when I >>> >>>>> am again >>> >>>>> running the below query- >>> >>>>> >>> >>>>> SELECT buyer_id, item_id, created_time >>> >>>>> FROM ( >>> >>>>> SELECT buyer_id, item_id, Rank(buyer_id) as rk, created_time >>> >>>>> FROM testingtable1 >>> >>>>> DISTRIBUTE BY buyer_id, item_id >>> >>>>> SORT BY buyer_id, item_id, created_time desc >>> >>>>> ) a >>> >>>>> WHERE rk < 10 >>> >>>>> ORDER BY buyer_id, rk; >>> >>>>> >>> >>>>> I am getting this below error- >>> >>>>> >>> >>>>> FAILED: Error in semantic analysis: line 9:19 Invalid Table Alias >>> >>>>> or >>> >>>>> Column Reference rk >>> >>>>> >>> >>>>> >>> >>>>> Why is it so? Any suggestions? >>> >>>>> >>> >>>>> >>> >>>>> Raihan Jamal >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On Mon, Jul 9, 2012 at 10:51 PM, Vijay <tec...@gmail.com> wrote: >>> >>>>>> >>> >>>>>> hive has no built-in rank function. you'd need to use a >>> >>>>>> user-defined >>> >>>>>> function (UDF) to simulate it. there are a few custom >>> >>>>>> implementations >>> >>>>>> on the net that you can leverage. >>> >>>>>> >>> >>>>>> On Mon, Jul 9, 2012 at 10:40 PM, Raihan Jamal >>> >>>>>> <jamalrai...@gmail.com> >>> >>>>>> wrote: >>> >>>>>> > What's wrong with the below query. >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > SELECT buyer_id, item_id, created_time >>> >>>>>> > FROM ( >>> >>>>>> > SELECT buyer_id, item_id, rank(buyer_id) as rank, >>> >>>>>> > created_time >>> >>>>>> > FROM testingtable1 >>> >>>>>> > DISTRIBUTE BY buyer_id, item_id >>> >>>>>> > SORT BY buyer_id, item_id, created_time desc >>> >>>>>> > ) a >>> >>>>>> > WHERE rank < 10 >>> >>>>>> > ORDER BY buyer_id, rank >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > I am always getting as - >>> >>>>>> > >>> >>>>>> > FAILED: Error in semantic analysis: line 3:30 Invalid Function >>> >>>>>> > rank >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > I am using Hive 0.6 I guess. >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > Raihan Jamal >>> >>>>>> > >>> >>>>> >>> >>>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Nitin Pawar >>> >>>> >>> >>> >>> > >> >> >