Re: Re: Selecting the top 100 records per group by?

2016-09-29 Thread Mariano Semelman
lumn("rowNumber", >> row_number().over(overLocation)).filter("rowNumber<=50") >> >> here I add a column as rownumber, get all data partitioned and get the >> top 50 rows. >> >> >> >> >> >> Th

Re: Re: Selecting the top 100 records per group by?

2016-09-12 Thread Mich Talebzadeh
ere I add a column as rownumber, get all data partitioned and get the > top 50 rows. > > > > > > ThanksBest regards! > San.Luo > > - 原始邮件 - > 发件人:Mich Talebzadeh <mich.talebza...@gmail.com> > 收件人:"user @spark&

回复:Re: Selecting the top 100 records per group by?

2016-09-12 Thread luohui20001
r().over(overLocation)).filter("rowNumber<=50") here I add a column as rownumber, get all data partitioned and get the top 50 rows. ThanksBest regards! San.Luo - 原始邮件 - 发件人:Mich Talebzadeh <mich.talebza...@gmail.com> 收件人:"user @spark

Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh
You can of course do this using FP. val wSpec = Window.partitionBy('price).orderBy(desc("price")) df2.filter('security > " ").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY, substring('PRICE,1,7)).filter('rank<=10).show HTH Dr Mich Talebzadeh LinkedIn *

RE: Selecting the top 100 records per group by?

2016-09-11 Thread Mendelson, Assaf
urtonator2...@gmail.com> [mailto:burtonator2...@gmail.com] On Behalf Of Kevin Burton Sent: Sunday, September 11, 2016 6:33 AM To: Karl Higley Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Selecting the top 100 records per group by? Looks like you can do it w

Re: Selecting the top 100 records per group by?

2016-09-11 Thread Mich Talebzadeh
DENSE_RANK will give you ordering and sequence within a particular column. This is Hive var sqltext = """ | SELECT RANK, timecreated,security, price | FROM ( |SELECT timecreated,security, price, | DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK |

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
Looks like you can do it with dense_rank functions. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html I setup some basic records and seems like it did the right thing. Now time to throw 50TB and 100 spark nodes at this problem and see what happens :) On Sat,

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
Ah.. might actually. I'll have to mess around with that. On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley wrote: > Would `topByKey` help? > > https://github.com/apache/spark/blob/master/mllib/src/ > main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 > > Best, >

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Karl Higley
Would `topByKey` help? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 Best, Karl On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote: > I'm trying to figure out a way to group by and return the top