We have pandas API on spark <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html> which is very good.
from pyspark import pandas as ps You can use pdf = df.pandas_api() Where df is your pyspark dataframe. [image: image.png] Does this help you? df.groupby(['Country'])[['Population', 'City']].max() man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <patrick.tu...@gmail.com>: > Window functions don't work like traditional GROUP BYs. They allow you to > partition data and pull any relevant column, whether it's used in the > partition or not. > > I'm not sure what the syntax is for PySpark, but the standard SQL would be > something like this: > > WITH InputData AS > ( > SELECT 'USA' Country, 'New York' City, 9000000 Population > UNION > SELECT 'USA' Country, 'Miami', 6200000 Population > UNION > SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population > UNION > SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population > ) > > SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population > DESC) PopulationRank > FROM InputData; > > Results would be something like this: > > Country City Population PopulationRank > Ukraine Kyiv 3000000 1 > Ukraine Kharkiv 1400000 2 > USA New York 9000000 1 > USA Miami 6200000 2 > > Which you could further filter in another CTE or subquery where > PopulationRank = 1. > > As I mentioned, I'm not sure how this translates into PySpark, but that's > the general concept in SQL. > > On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> If we only wanted to know the biggest population, max function would >> suffice. The problem is I also want the name of the city with the biggest >> population. >> >> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sro...@gmail.com> wrote: >> >>> As Mich says, isn't this just max by population partitioned by country >>> in a window function? >>> >>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker < >>> oliv...@broadinstitute.org> wrote: >>> >>>> >>>> Hello, >>>> >>>> Thank you for the response! >>>> >>>> I can think of two ways to get the largest city by country, but both >>>> seem to be inefficient: >>>> >>>> (1) I could group by country, sort each group by population, add the >>>> row number within each group, and then retain only cities with a row number >>>> equal to 1. But it seems wasteful to sort everything when I only want the >>>> largest of each country >>>> >>>> (2) I could group by country, get the maximum city population for >>>> each country, join that with the original data frame, and then retain only >>>> cities with population equal to the maximum population in the country. But >>>> that seems also expensive because I need to join. >>>> >>>> Am I missing something? >>>> >>>> Thanks! >>>> >>>> Best, Oliver >>>> >>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> In spark you can use windowing function >>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to >>>>> achieve this >>>>> >>>>> HTH >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker < >>>>> oliv...@broadinstitute.org> wrote: >>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> How can I retain from each group only the row for which one value >>>>>> is the maximum of the group? For example, imagine a DataFrame containing >>>>>> all major cities in the world, with three columns: (1) City name (2) >>>>>> Country (3) population. How would I get a DataFrame that only contains >>>>>> the >>>>>> largest city in each country? Thanks! >>>>>> >>>>>> Best, Oliver >>>>>> >>>>>> -- >>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>> Institute <http://www.broadinstitute.org/> >>>>>> >>>>> >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297