Re: [PySpark] Getting the best row from each group

Bjørn Jørgensen Tue, 20 Dec 2022 14:45:22 -0800

https://github.com/apache/spark/pull/39134


tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker <
[email protected]>:

> Thank you for the suggestion. This would, however, involve converting my
> Dataframe to an RDD (and back later), which involves additional costs.
>
> On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh <
> [email protected]> wrote:
>
>> you can groupBy(country). and use mapPartitions method in which you can
>> iterate over all rows keeping 2 variables for maxPopulationSoFar and
>> corresponding city. Then return the city with max population.
>> I think as others suggested, it may be possible to use Bucketing, it
>> would give a more friendly SQL'ish way of doing and but not be the best in
>> performance as it needs to order/sort.
>> --
>> Raghavendra
>>
>>
>> On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
>> [email protected]> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>      Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>>> Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark] Getting the best row from each group

Reply via email to