Re: Re: Selecting the top 100 records per group by?

Mich Talebzadeh Mon, 12 Sep 2016 08:43:08 -0700

Hi,

I don't understand why you need to add a column row_number when you can use
rank or dens_rank?


Why  one cannot one use rank or dens_rank here?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 September 2016 at 15:37, <luohui20...@sina.com> wrote:

> hi kevin
> window function is what you need, like below:
>     val hivetable = hc.sql("select * from house_sale_pv_location")
>     val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
>     val sortedDF = hivetable.withColumn("rowNumber", row_number().over(
> overLocation)).filter("rowNumber<=50")
>
> here I add a column as rownumber,  get all data partitioned and get the
> top 50 rows.
>
>
>
> --------------------------------
>
> Thanks&amp;Best regards!
> San.Luo
>
> ----- 原始邮件 -----
> 发件人：Mich Talebzadeh <mich.talebza...@gmail.com>
> 收件人："user @spark" <user@spark.apache.org>
> 主题：Re: Selecting the top 100 records per group by?
> 日期：2016年09月11日 22点20分
>
> You can of course do this using FP.
>
> val wSpec = Window.partitionBy('price).orderBy(desc("price"))
> df2.filter('security > " 
> ").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED,
> 'SECURITY, substring('PRICE,1,7)).filter('rank<=10).show
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 07:15, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> DENSE_RANK will give you ordering and sequence within a particular column.
> This is Hive
>
>  var sqltext = """
>      | SELECT RANK, timecreated,security, price
>      |      FROM (
>      |            SELECT timecreated,security, price,
>      |       DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
>      |      FROM test.prices
>      |           ) tmp
>      |      WHERE rank <= 10
>      | """
> sql(sqltext).collect.foreach(println)
>
> [1,2016-09-09 16:55:44,Esso,99.995]
> [1,2016-09-09 21:22:52,AVIVA,99.995]
> [1,2016-09-09 21:22:52,Barclays,99.995]
> [1,2016-09-09 21:24:28,JPM,99.995]
> [1,2016-09-09 21:30:38,Microsoft,99.995]
> [1,2016-09-09 21:31:12,UNILEVER,99.995]
> [2,2016-09-09 16:54:14,BP,99.99]
> [2,2016-09-09 16:54:36,Tate & Lyle,99.99]
> [2,2016-09-09 16:56:28,EASYJET,99.99]
> [2,2016-09-09 16:59:28,IBM,99.99]
> [2,2016-09-09 20:16:10,EXPERIAN,99.99]
> [2,2016-09-09 22:25:20,Microsoft,99.99]
> [2,2016-09-09 22:53:49,Tate & Lyle,99.99]
> [3,2016-09-09 15:31:06,UNILEVER,99.985]
>
> HTH
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 September 2016 at 04:32, Kevin Burton <bur...@spinn3r.com> wrote:
>
> Looks like you can do it with dense_rank functions.
>
> https://databricks.com/blog/2015/07/15/introducing-window-fu
> nctions-in-spark-sql.html
>
> I setup some basic records and seems like it did the right thing.
>
> Now time to throw 50TB and 100 spark nodes at this problem and see what
> happens :)
>
> On Sat, Sep 10, 2016 at 7:42 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
> Ah.. might actually. I'll have to mess around with that.
>
> On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote:
>
> Would `topByKey` help?
>
> https://github.com/apache/spark/blob/master/mllib/src/main/s
> cala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>
> Best,
> Karl
>
> On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote:
>
> I'm trying to figure out a way to group by and return the top 100 records
> in that group.
>
> Something like:
>
> SELECT TOP(100, user_id) FROM posts GROUP BY user_id;
>
> But I can't really figure out the best way to do this...
>
> There is a FIRST and LAST aggregate function but this only returns one
> column.
>
> I could do something like:
>
> SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT
> 100;
>
> But that limit is applied for ALL the records. Not each individual user.
>
> The only other thing I can think of is to do a manual map reduce and then
> have the reducer only return the top 100 each time...
>
> Would LOVE some advice here...
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>
>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>
>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>
>
>
>

Re: Re: Selecting the top 100 records per group by?

Reply via email to