Re: [PySpark] Getting the best row from each group

Artemis User Tue, 20 Dec 2022 20:34:11 -0800

Try this one: "select country, city, max(population) from your_tablegroup by country"

Please note this returns a table of three columns, instead of two. Thisis a standard SQL query, and supported by Spark as well.


On 12/20/22 3:35 PM, Oliver Ruebenacker wrote:


     Hello,

  Let's say the data is like this:

+---------------+-------------------+------------+
| country       | city              | population |
+---------------+-------------------+------------+
| Germany       | Berlin | 3520031    |
| Germany       | Hamburg | 1787408    |
| Germany       | Munich | 1450381    |
| Turkey        | Ankara | 4587558    |
| Turkey        | Istanbul | 14025646   |
| Turkey        | Izmir | 2847691    |
| United States | Chicago, IL | 2670406    |
| United States | Los Angeles, CA | 4085014    |
| United States | New York City, NY | 8622357    |
+---------------+-------------------+------------+

I want to get the largest city in each country:

+---------------+-------------------+
| country       | city              |
+---------------+-------------------+
| Germany       | Berlin    |
| Turkey        | Istanbul    |
| United States | New York City, NY |
+---------------+-------------------+

Thanks!

     Best, Oliver

On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh<[email protected]> wrote:


    Hi,

    Windowing functions were invented to avoid doing lengthy group by etc.

    As usual there is a lot of heat but little light

    Please provide:

     1. Sample input. I gather this data is stored in some csv, tsv,
        table format
     2. The output that you would like to see.


    Have a look at this article of mine Technical Analysis of the
    latest UK House Price Index, Deploying Modern tools
    
<https://www.linkedin.com/pulse/technical-analysis-latest-uk-house-price-index-modern-mich/>


    The PySpark code and windowing functions arehere
    
<https://github.com/michTalebzadeh/DataScience/blob/datascience/src/workoutYearlyAveragePricesAll.py>


    HTH


    **view my Linkedin profile
    <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


    https://en.everybodywiki.com/Mich_Talebzadeh

    *Disclaimer:* Use it at your own risk.Any and all responsibility
    for any loss, damage or destruction of data or any other property
    which may arise from relying on this email's technical content is
    explicitly disclaimed. The author will in no case be liable for
    any monetary damages arising from such loss, damage or destruction.



    On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker
    <[email protected]> wrote:


             Hello,

          Thank you for the response!

          I can think of two ways to get the largest city by country,
        but both seem to be inefficient:

          (1) I could group by country, sort each group by population,
        add the row number within each group, and then retain only
        cities with a row number equal to 1. But it seems wasteful to
        sort everything when I only want the largest of each country

          (2) I could group by country, get the maximum city
        population for each country, join that with the original data
        frame, and then retain only cities with population equal to
        the maximum population in the country. But that seems also
        expensive because I need to join.

          Am I missing something?

          Thanks!

             Best, Oliver

        On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh
        <[email protected]> wrote:

            In spark you can use windowing function
            <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
            achieve this

            HTH


            **view my Linkedin profile
            <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


            https://en.everybodywiki.com/Mich_Talebzadeh

            *Disclaimer:* Use it at your own risk.Any and all
            responsibility for any loss, damage or destruction of data
            or any other property which may arise from relying on this
            email's technical content is explicitly disclaimed. The
            author will in no case be liable for any monetary damages
            arising from such loss, damage or destruction.



            On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker
            <[email protected]> wrote:


                     Hello,

                  How can I retain from each group only the row for
                which one value is the maximum of the group? For
                example, imagine a DataFrame containing all major
                cities in the world, with three columns: (1) City name
                (2) Country (3) population. How would I get a
                DataFrame that only contains the largest city in each
                country? Thanks!

                     Best, Oliver

--Oliver Ruebenacker, Ph.D. (he)

                Senior Software Engineer, Knowledge Portal Network
                <http://kp4cd.org/>, Flannick Lab
                <http://www.flannicklab.org/>, Broad Institute
                <http://www.broadinstitute.org/>

--Oliver Ruebenacker, Ph.D. (he)

        Senior Software Engineer, Knowledge Portal Network
        <http://kp4cd.org/>, Flannick Lab
        <http://www.flannicklab.org/>, Broad Institute
        <http://www.broadinstitute.org/>



--
Oliver Ruebenacker, Ph.D. (he)

Senior Software Engineer, Knowledge Portal Network<http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, BroadInstitute <http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Reply via email to