https://github.com/apache/spark/pull/39134
tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the suggestion. This would, however, involve converting my > Dataframe to an RDD (and back later), which involves additional costs. > > On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh < > raghavendr...@gmail.com> wrote: > >> you can groupBy(country). and use mapPartitions method in which you can >> iterate over all rows keeping 2 variables for maxPopulationSoFar and >> corresponding city. Then return the city with max population. >> I think as others suggested, it may be possible to use Bucketing, it >> would give a more friendly SQL'ish way of doing and but not be the best in >> performance as it needs to order/sort. >> -- >> Raghavendra >> >> >> On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker < >> oliv...@broadinstitute.org> wrote: >> >>> >>> Hello, >>> >>> How can I retain from each group only the row for which one value is >>> the maximum of the group? For example, imagine a DataFrame containing all >>> major cities in the world, with three columns: (1) City name (2) Country >>> (3) population. How would I get a DataFrame that only contains the largest >>> city in each country? Thanks! >>> >>> Best, Oliver >>> >>> -- >>> Oliver Ruebenacker, Ph.D. (he) >>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>> Flannick >>> Lab <http://www.flannicklab.org/>, Broad Institute >>> <http://www.broadinstitute.org/> >>> >> > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> >