If we only wanted to know the biggest population, max function would suffice. The problem is I also want the name of the city with the biggest population.
On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sro...@gmail.com> wrote: > As Mich says, isn't this just max by population partitioned by country in > a window function? > > On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello, >> >> Thank you for the response! >> >> I can think of two ways to get the largest city by country, but both >> seem to be inefficient: >> >> (1) I could group by country, sort each group by population, add the >> row number within each group, and then retain only cities with a row number >> equal to 1. But it seems wasteful to sort everything when I only want the >> largest of each country >> >> (2) I could group by country, get the maximum city population for each >> country, join that with the original data frame, and then retain only >> cities with population equal to the maximum population in the country. But >> that seems also expensive because I need to join. >> >> Am I missing something? >> >> Thanks! >> >> Best, Oliver >> >> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> In spark you can use windowing function >>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to >>> achieve this >>> >>> HTH >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker < >>> oliv...@broadinstitute.org> wrote: >>> >>>> >>>> Hello, >>>> >>>> How can I retain from each group only the row for which one value is >>>> the maximum of the group? For example, imagine a DataFrame containing all >>>> major cities in the world, with three columns: (1) City name (2) Country >>>> (3) population. How would I get a DataFrame that only contains the largest >>>> city in each country? Thanks! >>>> >>>> Best, Oliver >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>