hColumn("rowNumber",
>> row_number().over(overLocation)).filter("rowNumber<=50")
>>
>> here I add a column as rownumber, get all data partitioned and get the
>> top 50 rows.
>>
>>
>>
>>
>>
>> Thanks&Best regards!
>> San.Luo
>&
re I add a column as rownumber, get all data partitioned and get the
> top 50 rows.
>
>
>
>
>
> Thanks&Best regards!
> San.Luo
>
> - 原始邮件 -
> 发件人:Mich Talebzadeh
> 收件人:"user @spark"
> 主题:Re: Selecting the top 100 recor
r().over(overLocation)).filter("rowNumber<=50")
here I add a column as rownumber, get all data partitioned and get the top 50
rows.
Thanks&Best regards!
San.Luo
- 原始邮件 -----
发件人:Mich Talebzadeh
收件人:"user @spark"
主题:Re: Selecting the
You can of course do this using FP.
val wSpec = Window.partitionBy('price).orderBy(desc("price"))
df2.filter('security > "
").select(dense_rank().over(wSpec).as("rank"),'TIMECREATED, 'SECURITY,
substring('PRICE,1,7)).filter('rank<=10).show
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.link
urtonator2...@gmail.com>
[mailto:burtonator2...@gmail.com] On Behalf Of Kevin Burton
Sent: Sunday, September 11, 2016 6:33 AM
To: Karl Higley
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Selecting the top 100 records per group by?
Looks like you can do it w
DENSE_RANK will give you ordering and sequence within a particular column.
This is Hive
var sqltext = """
| SELECT RANK, timecreated,security, price
| FROM (
|SELECT timecreated,security, price,
| DENSE_RANK() OVER (ORDER BY price DESC ) AS RANK
|
Looks like you can do it with dense_rank functions.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I setup some basic records and seems like it did the right thing.
Now time to throw 50TB and 100 spark nodes at this problem and see what
happens :)
On Sat,
Ah.. might actually. I'll have to mess around with that.
On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley wrote:
> Would `topByKey` help?
>
> https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
>
> Best,
> Karl
>
> On Sat, Sep 1
Would `topByKey` help?
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
Best,
Karl
On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote:
> I'm trying to figure out a way to group by and return the top 100 records
> in that g