As Mich says, isn't this just max by population partitioned by country in a
window function?

On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <oliv...@broadinstitute.org>
wrote:

>
>      Hello,
>
>   Thank you for the response!
>
>   I can think of two ways to get the largest city by country, but both
> seem to be inefficient:
>
>   (1) I could group by country, sort each group by population, add the row
> number within each group, and then retain only cities with a row number
> equal to 1. But it seems wasteful to sort everything when I only want the
> largest of each country
>
>   (2) I could group by country, get the maximum city population for each
> country, join that with the original data frame, and then retain only
> cities with population equal to the maximum population in the country. But
> that seems also expensive because I need to join.
>
>   Am I missing something?
>
>   Thanks!
>
>      Best, Oliver
>
> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> In spark you can use windowing function
>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>> achieve this
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>      Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>>> Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Reply via email to