Thanks Sean, the important part of your answer for me is that orderBy + limit is doing only "partial sort" because of optimizer. That's what I was missing. I will give it a try...
J.D. On Mon, Sep 5, 2016 at 2:26 PM, Sean Owen <so...@cloudera.com> wrote: > No, > I'm not advising you to use .rdd, just saying it is possible. > Although I'd only use RDDs if you had a good reason to, given Datasets > now, they are not gone or even deprecated. > > You do not need to order the whole data set to get the top eleme > nt. That isn't what top does though. You might be interested to look at > the source code. Nor is it what orderBy does if the optimizer is any good. > > Computing .rdd doesn't materialize an RDD. It involves some non-zero > overhead in creating a plan, which should be minor compared to execution. > So would any computation of "top N" on a Dataset, so I don't think this is > relevant. > > > orderBy + take is already the way to accomplish "Dataset.top". It works > on Datasets, and therefore DataFrames too, for the reason you give. I'm not > sure what you're asking there. > > > On Mon, Sep 5, 2016, 13:01 Jakub Dubovsky <spark.dubovsky.ja...@gmail.com> > wrote: > >> Thanks Sean, >> >> I was under impression that spark creators are trying to persuade user >> community not to use RDD api directly. Spark summit I attended was full of >> this. So I am a bit surprised that I hear use-rdd-api as an advice from >> you. But if this is a way then I have a second question. For conversion >> from dataset to rdd I would use Dataset.rdd lazy val. Since it is a lazy >> val it suggests there is some computation going on to create rdd as a copy. >> The question is how much computationally expansive is this conversion? If >> there is a significant overhead then it is clear why one would want to have >> top method directly on Dataset class. >> >> Ordering whole dataset only to take first 10 or so top records is not >> really an acceptable option for us. Comparison function can be expansive >> and the size of dataset is (unsurprisingly) big. >> >> To be honest I do not really understand what do you mean by b). Since >> DataFrame is now only an alias for Dataset[Row] what do you mean by >> "DataFrame-like counterpart"? >> >> Thanks >> >> On Thu, Sep 1, 2016 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> You can always call .rdd.top(n) of course. Although it's slightly >>> clunky, you can also .orderBy($"value".desc).take(n). Maybe there's an >>> easier way. >>> >>> I don't think if there's a strong reason other than it wasn't worth it >>> to write this and many other utility wrappers that a) already exist on >>> the underlying RDD API if you want them, and b) have a DataFrame-like >>> counterpart already that doesn't really need wrapping in a different >>> API. >>> >>> On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky >>> <spark.dubovsky.ja...@gmail.com> wrote: >>> > Hey all, >>> > >>> > in RDD api there is very usefull method called top. It finds top n >>> records >>> > in according to certain ordering without sorting all records. Very >>> usefull! >>> > >>> > There is no top method nor similar functionality in Dataset api. Has >>> anybody >>> > any clue why? Is there any specific reason for this? >>> > >>> > Any thoughts? >>> > >>> > thanks >>> > >>> > Jakub D. >>> >> >>