Re: [PySpark] Getting the best row from each group

Oliver Ruebenacker Tue, 20 Dec 2022 12:35:54 -0800

     Hello,

  Let's say the data is like this:


+---------------+-------------------+------------+
| country       | city              | population |
+---------------+-------------------+------------+
| Germany       | Berlin            | 3520031    |
| Germany       | Hamburg           | 1787408    |
| Germany       | Munich            | 1450381    |
| Turkey        | Ankara            | 4587558    |
| Turkey        | Istanbul          | 14025646   |
| Turkey        | Izmir             | 2847691    |
| United States | Chicago, IL       | 2670406    |
| United States | Los Angeles, CA   | 4085014    |
| United States | New York City, NY | 8622357    |
+---------------+-------------------+------------+

I want to get the largest city in each country:

+---------------+-------------------+
| country       | city              |
+---------------+-------------------+
| Germany       | Berlin            |
| Turkey        | Istanbul          |
| United States | New York City, NY |
+---------------+-------------------+

Thanks!

     Best, Oliver

On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh <[email protected]>
wrote:

> Hi,
>
> Windowing functions were invented to avoid doing lengthy group by etc.
>
> As usual there is a lot of heat but little light
>
> Please provide:
>
>
>    1. Sample input. I gather this data is stored in some csv, tsv, table
>    format
>    2. The output that you would like to see.
>
>
> Have a look at this article of mine  Technical Analysis of the latest UK
> House Price Index, Deploying Modern tools
> <https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/>
>
>
> The PySpark code and windowing functions are here
> <https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py>
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker <
> [email protected]> wrote:
>
>>
>>      Hello,
>>
>>   Thank you for the response!
>>
>>   I can think of two ways to get the largest city by country, but both
>> seem to be inefficient:
>>
>>   (1) I could group by country, sort each group by population, add the
>> row number within each group, and then retain only cities with a row number
>> equal to 1. But it seems wasteful to sort everything when I only want the
>> largest of each country
>>
>>   (2) I could group by country, get the maximum city population for each
>> country, join that with the original data frame, and then retain only
>> cities with population equal to the maximum population in the country. But
>> that seems also expensive because I need to join.
>>
>>   Am I missing something?
>>
>>   Thanks!
>>
>>      Best, Oliver
>>
>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>> In spark you can use windowing function
>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>> achieve this
>>>
>>> HTH
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>> [email protected]> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   How can I retain from each group only the row for which one value is
>>>> the maximum of the group? For example, imagine a DataFrame containing all
>>>> major cities in the world, with three columns: (1) City name (2) Country
>>>> (3) population. How would I get a DataFrame that only contains the largest
>>>> city in each country? Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Reply via email to