Re: Why there is no top method in dataset api

2016-09-13 Thread Jakub Dubovsky
Thanks Sean, the important part of your answer for me is that orderBy + limit is doing only "partial sort" because of optimizer. That's what I was missing. I will give it a try... J.D. On Mon, Sep 5, 2016 at 2:26 PM, Sean Owen wrote: > ​No, ​ > I'm not advising you to use

Re: Why there is no top method in dataset api

2016-09-05 Thread Sean Owen
​No, ​ I'm not advising you to use .rdd, just saying it is possible. ​Although I'd only use RDDs if you had a good reason to, given Datasets now, they are not gone or even deprecated.​ You do not need to order the whole data set to get the top eleme ​nt. That isn't what top does though. You might

Re: Why there is no top method in dataset api

2016-09-05 Thread Jakub Dubovsky
Thanks Sean, I was under impression that spark creators are trying to persuade user community not to use RDD api directly. Spark summit I attended was full of this. So I am a bit surprised that I hear use-rdd-api as an advice from you. But if this is a way then I have a second question. For

Re: Why there is no top method in dataset api

2016-09-01 Thread Sean Owen
You can always call .rdd.top(n) of course. Although it's slightly clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an easier way. I don't think if there's a strong reason other than it wasn't worth it to write this and many other utility wrappers that a) already exist on the

Why there is no top method in dataset api

2016-09-01 Thread Jakub Dubovsky
Hey all, in RDD api there is very usefull method called top. It finds top n records in according to certain ordering without sorting all records. Very usefull! There is no top method nor similar functionality in Dataset api. Has anybody any clue why? Is there any specific reason for this? Any