Post an example dataframe and how you will have the result. man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>:
> Thank you, that is an interesting idea. Instead of finding the maximum > population, we are finding the maximum (population, city name) tuple. > > On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> We have pandas API on spark >> <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html> >> which is very good. >> >> from pyspark import pandas as ps >> >> You can use pdf = df.pandas_api() >> Where df is your pyspark dataframe. >> >> >> [image: image.png] >> >> Does this help you? >> >> df.groupby(['Country'])[['Population', 'City']].max() >> >> man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <patrick.tu...@gmail.com >> >: >> >>> Window functions don't work like traditional GROUP BYs. They allow you >>> to partition data and pull any relevant column, whether it's used in the >>> partition or not. >>> >>> I'm not sure what the syntax is for PySpark, but the standard SQL would >>> be something like this: >>> >>> WITH InputData AS >>> ( >>> SELECT 'USA' Country, 'New York' City, 9000000 Population >>> UNION >>> SELECT 'USA' Country, 'Miami', 6200000 Population >>> UNION >>> SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population >>> UNION >>> SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population >>> ) >>> >>> SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population >>> DESC) PopulationRank >>> FROM InputData; >>> >>> Results would be something like this: >>> >>> Country City Population PopulationRank >>> Ukraine Kyiv 3000000 1 >>> Ukraine Kharkiv 1400000 2 >>> USA New York 9000000 1 >>> USA Miami 6200000 2 >>> >>> Which you could further filter in another CTE or subquery where >>> PopulationRank = 1. >>> >>> As I mentioned, I'm not sure how this translates into PySpark, but >>> that's the general concept in SQL. >>> >>> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker < >>> oliv...@broadinstitute.org> wrote: >>> >>>> If we only wanted to know the biggest population, max function would >>>> suffice. The problem is I also want the name of the city with the biggest >>>> population. >>>> >>>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> As Mich says, isn't this just max by population partitioned by country >>>>> in a window function? >>>>> >>>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker < >>>>> oliv...@broadinstitute.org> wrote: >>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> Thank you for the response! >>>>>> >>>>>> I can think of two ways to get the largest city by country, but >>>>>> both seem to be inefficient: >>>>>> >>>>>> (1) I could group by country, sort each group by population, add >>>>>> the row number within each group, and then retain only cities with a row >>>>>> number equal to 1. But it seems wasteful to sort everything when I only >>>>>> want the largest of each country >>>>>> >>>>>> (2) I could group by country, get the maximum city population for >>>>>> each country, join that with the original data frame, and then retain >>>>>> only >>>>>> cities with population equal to the maximum population in the country. >>>>>> But >>>>>> that seems also expensive because I need to join. >>>>>> >>>>>> Am I missing something? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Best, Oliver >>>>>> >>>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> In spark you can use windowing function >>>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to >>>>>>> achieve this >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker < >>>>>>> oliv...@broadinstitute.org> wrote: >>>>>>> >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> How can I retain from each group only the row for which one value >>>>>>>> is the maximum of the group? For example, imagine a DataFrame >>>>>>>> containing >>>>>>>> all major cities in the world, with three columns: (1) City name (2) >>>>>>>> Country (3) population. How would I get a DataFrame that only contains >>>>>>>> the >>>>>>>> largest city in each country? Thanks! >>>>>>>> >>>>>>>> Best, Oliver >>>>>>>> >>>>>>>> -- >>>>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>>>> Institute <http://www.broadinstitute.org/> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>> Institute <http://www.broadinstitute.org/> >>>>>> >>>>> >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297