Re: [PySpark] Getting the best row from each group

Bjørn Jørgensen Mon, 19 Dec 2022 11:42:37 -0800

Post an example dataframe and how you will have the result.

man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker <
oliv...@broadinstitute.org>:


> Thank you, that is an interesting idea. Instead of finding the maximum
> population, we are finding the maximum (population, city name) tuple.
>
> On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> We have pandas API on spark
>> <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html>
>> which is very good.
>>
>> from pyspark import pandas as ps
>>
>> You can use pdf = df.pandas_api()
>> Where df is your pyspark dataframe.
>>
>>
>> [image: image.png]
>>
>> Does this help you?
>>
>> df.groupby(['Country'])[['Population', 'City']].max()
>>
>> man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <patrick.tu...@gmail.com
>> >:
>>
>>> Window functions don't work like traditional GROUP BYs. They allow you
>>> to partition data and pull any relevant column, whether it's used in the
>>> partition or not.
>>>
>>> I'm not sure what the syntax is for PySpark, but the standard SQL would
>>> be something like this:
>>>
>>> WITH InputData AS
>>> (
>>>   SELECT 'USA' Country, 'New York' City, 9000000 Population
>>>   UNION
>>>   SELECT 'USA' Country, 'Miami', 6200000 Population
>>>   UNION
>>>   SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population
>>>   UNION
>>>   SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population
>>> )
>>>
>>>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
>>> DESC) PopulationRank
>>>  FROM InputData;
>>>
>>> Results would be something like this:
>>>
>>> Country    City       Population     PopulationRank
>>> Ukraine    Kyiv       3000000        1
>>> Ukraine    Kharkiv    1400000        2
>>> USA        New York   9000000        1
>>> USA        Miami      6200000        2
>>>
>>> Which you could further filter in another CTE or subquery where
>>> PopulationRank = 1.
>>>
>>> As I mentioned, I'm not sure how this translates into PySpark, but
>>> that's the general concept in SQL.
>>>
>>> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>
>>>> If we only wanted to know the biggest population, max function would
>>>> suffice. The problem is I also want the name of the city with the biggest
>>>> population.
>>>>
>>>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> As Mich says, isn't this just max by population partitioned by country
>>>>> in a window function?
>>>>>
>>>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>>>>> oliv...@broadinstitute.org> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   Thank you for the response!
>>>>>>
>>>>>>   I can think of two ways to get the largest city by country, but
>>>>>> both seem to be inefficient:
>>>>>>
>>>>>>   (1) I could group by country, sort each group by population, add
>>>>>> the row number within each group, and then retain only cities with a row
>>>>>> number equal to 1. But it seems wasteful to sort everything when I only
>>>>>> want the largest of each country
>>>>>>
>>>>>>   (2) I could group by country, get the maximum city population for
>>>>>> each country, join that with the original data frame, and then retain 
>>>>>> only
>>>>>> cities with population equal to the maximum population in the country. 
>>>>>> But
>>>>>> that seems also expensive because I need to join.
>>>>>>
>>>>>>   Am I missing something?
>>>>>>
>>>>>>   Thanks!
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> In spark you can use windowing function
>>>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>>>>> achieve this
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>>>>> oliv...@broadinstitute.org> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>      Hello,
>>>>>>>>
>>>>>>>>   How can I retain from each group only the row for which one value
>>>>>>>> is the maximum of the group? For example, imagine a DataFrame 
>>>>>>>> containing
>>>>>>>> all major cities in the world, with three columns: (1) City name (2)
>>>>>>>> Country (3) population. How would I get a DataFrame that only contains 
>>>>>>>> the
>>>>>>>> largest city in each country? Thanks!
>>>>>>>>
>>>>>>>>      Best, Oliver
>>>>>>>>
>>>>>>>> --
>>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: [PySpark] Getting the best row from each group

Reply via email to