Thanks Sean,
the important part of your answer for me is that orderBy + limit is doing
only "partial sort" because of optimizer. That's what I was missing. I will
give it a try...
J.D.
On Mon, Sep 5, 2016 at 2:26 PM, Sean Owen wrote:
> No,
> I'm not advising you to use
No,
I'm not advising you to use .rdd, just saying it is possible.
Although I'd only use RDDs if you had a good reason to, given Datasets
now, they are not gone or even deprecated.
You do not need to order the whole data set to get the top eleme
nt. That isn't what top does though. You might
Thanks Sean,
I was under impression that spark creators are trying to persuade user
community not to use RDD api directly. Spark summit I attended was full of
this. So I am a bit surprised that I hear use-rdd-api as an advice from
you. But if this is a way then I have a second question. For
You can always call .rdd.top(n) of course. Although it's slightly
clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an
easier way.
I don't think if there's a strong reason other than it wasn't worth it
to write this and many other utility wrappers that a) already exist on
the
Hey all,
in RDD api there is very usefull method called top. It finds top n records
in according to certain ordering without sorting all records. Very usefull!
There is no top method nor similar functionality in Dataset api. Has
anybody any clue why? Is there any specific reason for this?
Any