回复：Re: Selecting the top 100 records per group by?

luohui20001 Mon, 12 Sep 2016 07:38:50 -0700

hi kevinwindow function is what you need, like below:    val hivetable = 
hc.sql("select * from house_sale_pv_location")
    val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
    val sortedDF = hivetable.withColumn("rowNumber", 
row_number().over(overLocation)).filter("rowNumber<=50")
here I add a column as rownumber,  get all data partitioned and get the top 50 
rows.



--------------------------------

 

Thanks&amp;Best regards!
San.Luo

----- 原始邮件 -----
发件人：Mich Talebzadeh <mich.talebza...@gmail.com>
收件人："user @spark" <user@spark.apache.org>
主题：Re: Selecting the top 100 records per group by?
日期：2016年09月11日 22点20分

You can of course do this using FP.
val wSpec = 
Window.partitionBy('price).orderBy(desc("price"))df2.filter('security > " 
").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY, 
substring('PRICE,1,7)).filter('rank<=10).show


HTH



Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 September 2016 at 07:15, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:
DENSE_RANK will give you ordering and sequence within a particular column. This 
is Hive 
 var sqltext = """
     | SELECT RANK, timecreated,security, price
     |      FROM (
     |            SELECT timecreated,security, price,
     |       DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
     |      FROM test.prices
     |           ) tmp
     |      WHERE rank <= 10
     | """
sql(sqltext).collect.foreach(println)[1,2016-09-09 16:55:44,Esso,99.995]
[1,2016-09-09 21:22:52,AVIVA,99.995]
[1,2016-09-09 21:22:52,Barclays,99.995]
[1,2016-09-09 21:24:28,JPM,99.995]
[1,2016-09-09 21:30:38,Microsoft,99.995]
[1,2016-09-09 21:31:12,UNILEVER,99.995]
[2,2016-09-09 16:54:14,BP,99.99]
[2,2016-09-09 16:54:36,Tate & Lyle,99.99]
[2,2016-09-09 16:56:28,EASYJET,99.99]
[2,2016-09-09 16:59:28,IBM,99.99]
[2,2016-09-09 20:16:10,EXPERIAN,99.99]
[2,2016-09-09 22:25:20,Microsoft,99.99]
[2,2016-09-09 22:53:49,Tate & Lyle,99.99]
[3,2016-09-09 15:31:06,UNILEVER,99.985]

HTH







Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 11 September 2016 at 04:32, Kevin Burton <bur...@spinn3r.com> wrote:
Looks like you can do it with dense_rank functions.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

I setup some basic records and seems like it did the right thing.
Now time to throw 50TB and 100 spark nodes at this problem and see what happens 
:)
On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton <bur...@spinn3r.com> wrote:
Ah.. might actually. I'll have to mess around with that.
On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote:
Would `topByKey` help?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42

Best,Karl
On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote:
I'm trying to figure out a way to group by and return the top 100 records in 
that group.
Something like:
SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
But I can't really figure out the best way to do this... 
There is a FIRST and LAST aggregate function but this only returns one column.
I could do something like:
SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT 100;
But that limit is applied for ALL the records. Not each individual user.  
The only other thing I can think of is to do a manual map reduce and then have 
the reducer only return the top 100 each time... 
Would LOVE some advice here... 
-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile




-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile




-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile

回复：Re: Selecting the top 100 records per group by?

Reply via email to