As Mich says, isn't this just max by population partitioned by country in a window function?
On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <oliv...@broadinstitute.org> wrote: > > Hello, > > Thank you for the response! > > I can think of two ways to get the largest city by country, but both > seem to be inefficient: > > (1) I could group by country, sort each group by population, add the row > number within each group, and then retain only cities with a row number > equal to 1. But it seems wasteful to sort everything when I only want the > largest of each country > > (2) I could group by country, get the maximum city population for each > country, join that with the original data frame, and then retain only > cities with population equal to the maximum population in the country. But > that seems also expensive because I need to join. > > Am I missing something? > > Thanks! > > Best, Oliver > > On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> In spark you can use windowing function >> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to >> achieve this >> >> HTH >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker < >> oliv...@broadinstitute.org> wrote: >> >>> >>> Hello, >>> >>> How can I retain from each group only the row for which one value is >>> the maximum of the group? For example, imagine a DataFrame containing all >>> major cities in the world, with three columns: (1) City name (2) Country >>> (3) population. How would I get a DataFrame that only contains the largest >>> city in each country? Thanks! >>> >>> Best, Oliver >>> >>> -- >>> Oliver Ruebenacker, Ph.D. (he) >>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>> Flannick >>> Lab <http://www.flannicklab.org/>, Broad Institute >>> <http://www.broadinstitute.org/> >>> >> > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> >