Hello, Let's say the data is like this:
+---------------+-------------------+------------+ | country | city | population | +---------------+-------------------+------------+ | Germany | Berlin | 3520031 | | Germany | Hamburg | 1787408 | | Germany | Munich | 1450381 | | Turkey | Ankara | 4587558 | | Turkey | Istanbul | 14025646 | | Turkey | Izmir | 2847691 | | United States | Chicago, IL | 2670406 | | United States | Los Angeles, CA | 4085014 | | United States | New York City, NY | 8622357 | +---------------+-------------------+------------+ I want to get the largest city in each country: +---------------+-------------------+ | country | city | +---------------+-------------------+ | Germany | Berlin | | Turkey | Istanbul | | United States | New York City, NY | +---------------+-------------------+ Thanks! Best, Oliver On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > Windowing functions were invented to avoid doing lengthy group by etc. > > As usual there is a lot of heat but little light > > Please provide: > > > 1. Sample input. I gather this data is stored in some csv, tsv, table > format > 2. The output that you would like to see. > > > Have a look at this article of mine Technical Analysis of the latest UK > House Price Index, Deploying Modern tools > <https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/> > > > The PySpark code and windowing functions are here > <https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py> > > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello, >> >> Thank you for the response! >> >> I can think of two ways to get the largest city by country, but both >> seem to be inefficient: >> >> (1) I could group by country, sort each group by population, add the >> row number within each group, and then retain only cities with a row number >> equal to 1. But it seems wasteful to sort everything when I only want the >> largest of each country >> >> (2) I could group by country, get the maximum city population for each >> country, join that with the original data frame, and then retain only >> cities with population equal to the maximum population in the country. But >> that seems also expensive because I need to join. >> >> Am I missing something? >> >> Thanks! >> >> Best, Oliver >> >> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> In spark you can use windowing function >>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to >>> achieve this >>> >>> HTH >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker < >>> oliv...@broadinstitute.org> wrote: >>> >>>> >>>> Hello, >>>> >>>> How can I retain from each group only the row for which one value is >>>> the maximum of the group? For example, imagine a DataFrame containing all >>>> major cities in the world, with three columns: (1) City name (2) Country >>>> (3) population. How would I get a DataFrame that only contains the largest >>>> city in each country? Thanks! >>>> >>>> Best, Oliver >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>