Re: [PySpark] Getting the best row from each group

2022-12-21 Thread Oliver Ruebenacker
Wow, thank you so much! On Wed, Dec 21, 2022 at 10:27 AM Mich Talebzadeh wrote: > OK let us try this > > 1) we have a csv file as below called cities.csv > > country,city,population > Germany,Berlin,3520031 > Germany,Hamburg,1787408 > Germany,Munich,1450381 > Turkey,Ankara,4587558 >

Re: [PySpark] Getting the best row from each group

2022-12-21 Thread Mich Talebzadeh
OK let us try this 1) we have a csv file as below called cities.csv country,city,population Germany,Berlin,3520031 Germany,Hamburg,1787408 Germany,Munich,1450381 Turkey,Ankara,4587558 Turkey,Istanbul,14025646 Turkey,Izmir,2847691 United States,Chicago IL,2670406 United States,Los Angeles

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Artemis User
Try this one:  "select country, city, max(population) from your_table group by country" Please note this returns a table of three columns, instead of two. This is a standard SQL query, and supported by Spark as well. On 12/20/22 3:35 PM, Oliver Ruebenacker wrote: Hello,   Let's say

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Bjørn Jørgensen
https://github.com/apache/spark/pull/39134 tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the suggestion. This would, however, involve converting my > Dataframe to an RDD (and back later), which involves additional costs. > > On Tue, Dec 20,

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Thank you for the suggestion. This would, however, involve converting my Dataframe to an RDD (and back later), which involves additional costs. On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh wrote: > you can groupBy(country). and use mapPartitions method in which you can > iterate over all

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Hello, Let's say the data is like this: +---+---++ | country | city | population | +---+---++ | Germany | Berlin| 3520031| | Germany | Hamburg | 1787408

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Raghavendra Ganesh
you can groupBy(country). and use mapPartitions method in which you can iterate over all rows keeping 2 variables for maxPopulationSoFar and corresponding city. Then return the city with max population. I think as others suggested, it may be possible to use Bucketing, it would give a more friendly

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Mich Talebzadeh
Hi, Windowing functions were invented to avoid doing lengthy group by etc. As usual there is a lot of heat but little light Please provide: 1. Sample input. I gather this data is stored in some csv, tsv, table format 2. The output that you would like to see. Have a look at this

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
Post an example dataframe and how you will have the result. man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you, that is an interesting idea. Instead of finding the maximum > population, we are finding the maximum (population, city name) tuple. > > On

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Thank you, that is an interesting idea. Instead of finding the maximum population, we are finding the maximum (population, city name) tuple. On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen wrote: > We have pandas API on spark >

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
We have pandas API on spark which is very good. from pyspark import pandas as ps You can use pdf = df.pandas_api() Where df is your pyspark dataframe. [image: image.png] Does this help you?

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to partition data and pull any relevant column, whether it's used in the partition or not. I'm not sure what the syntax is for PySpark, but the standard SQL would be something like this: WITH InputData AS ( SELECT 'USA'

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
If we only wanted to know the biggest population, max function would suffice. The problem is I also want the name of the city with the biggest population. On Mon, Dec 19, 2022 at 11:58 AM Sean Owen wrote: > As Mich says, isn't this just max by population partitioned by country in > a window

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Sean Owen
As Mich says, isn't this just max by population partitioned by country in a window function? On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker wrote: > > Hello, > > Thank you for the response! > > I can think of two ways to get the largest city by country, but both > seem to be

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Hello, Thank you for the response! I can think of two ways to get the largest city by country, but both seem to be inefficient: (1) I could group by country, sort each group by population, add the row number within each group, and then retain only cities with a row number equal to 1.

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Mich Talebzadeh
In spark you can use windowing function s to achieve this HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own