Would `topByKey` help? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42
Best, Karl On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote: > I'm trying to figure out a way to group by and return the top 100 records > in that group. > > Something like: > > SELECT TOP(100, user_id) FROM posts GROUP BY user_id; > > But I can't really figure out the best way to do this... > > There is a FIRST and LAST aggregate function but this only returns one > column. > > I could do something like: > > SELECT * FROM posts WHERE user_id IN ( /* select top users here */ ) LIMIT > 100; > > But that limit is applied for ALL the records. Not each individual user. > > The only other thing I can think of is to do a manual map reduce and then > have the reducer only return the top 100 each time... > > Would LOVE some advice here... > > -- > > We’re hiring if you know of any awesome Java Devops or Linux Operations > Engineers! > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > blog: http://burtonator.wordpress.com > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > >