you can groupBy(country). and use mapPartitions method in which you can iterate over all rows keeping 2 variables for maxPopulationSoFar and corresponding city. Then return the city with max population. I think as others suggested, it may be possible to use Bucketing, it would give a more friendly SQL'ish way of doing and but not be the best in performance as it needs to order/sort. -- Raghavendra
On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > How can I retain from each group only the row for which one value is the > maximum of the group? For example, imagine a DataFrame containing all major > cities in the world, with three columns: (1) City name (2) Country (3) > population. How would I get a DataFrame that only contains the largest city > in each country? Thanks! > > Best, Oliver > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> >