Re: [PySpark] Getting the best row from each group

Bjørn Jørgensen Mon, 19 Dec 2022 11:11:04 -0800

We have pandas API on spark
<https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html>
which is very good.


from pyspark import pandas as ps

You can use pdf = df.pandas_api()
Where df is your pyspark dataframe.


[image: image.png]

Does this help you?

df.groupby(['Country'])[['Population', 'City']].max()

man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <[email protected]>:

> Window functions don't work like traditional GROUP BYs. They allow you to
> partition data and pull any relevant column, whether it's used in the
> partition or not.
>
> I'm not sure what the syntax is for PySpark, but the standard SQL would be
> something like this:
>
> WITH InputData AS
> (
>   SELECT 'USA' Country, 'New York' City, 9000000 Population
>   UNION
>   SELECT 'USA' Country, 'Miami', 6200000 Population
>   UNION
>   SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population
>   UNION
>   SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population
> )
>
>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
> DESC) PopulationRank
>  FROM InputData;
>
> Results would be something like this:
>
> Country    City       Population     PopulationRank
> Ukraine    Kyiv       3000000        1
> Ukraine    Kharkiv    1400000        2
> USA        New York   9000000        1
> USA        Miami      6200000        2
>
> Which you could further filter in another CTE or subquery where
> PopulationRank = 1.
>
> As I mentioned, I'm not sure how this translates into PySpark, but that's
> the general concept in SQL.
>
> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
> [email protected]> wrote:
>
>> If we only wanted to know the biggest population, max function would
>> suffice. The problem is I also want the name of the city with the biggest
>> population.
>>
>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <[email protected]> wrote:
>>
>>> As Mich says, isn't this just max by population partitioned by country
>>> in a window function?
>>>
>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>>> [email protected]> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   Thank you for the response!
>>>>
>>>>   I can think of two ways to get the largest city by country, but both
>>>> seem to be inefficient:
>>>>
>>>>   (1) I could group by country, sort each group by population, add the
>>>> row number within each group, and then retain only cities with a row number
>>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>>> largest of each country
>>>>
>>>>   (2) I could group by country, get the maximum city population for
>>>> each country, join that with the original data frame, and then retain only
>>>> cities with population equal to the maximum population in the country. But
>>>> that seems also expensive because I need to join.
>>>>
>>>>   Am I missing something?
>>>>
>>>>   Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>>> In spark you can use windowing function
>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>>> achieve this
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   How can I retain from each group only the row for which one value
>>>>>> is the maximum of the group? For example, imagine a DataFrame containing
>>>>>> all major cities in the world, with three columns: (1) City name (2)
>>>>>> Country (3) population. How would I get a DataFrame that only contains 
>>>>>> the
>>>>>> largest city in each country? Thanks!
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: [PySpark] Getting the best row from each group

Reply via email to