Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Artemis User
Try this one:  "select country, city, max(population) from your_table group by country" Please note this returns a table of three columns, instead of two. This is a standard SQL query, and supported by Spark as well. On 12/20/22 3:35 PM, Oliver Ruebenacker wrote: Hello,   Let's say

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Bjørn Jørgensen
https://github.com/apache/spark/pull/39134 tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the suggestion. This would, however, involve converting my > Dataframe to an RDD (and back later), which involves additional costs. > > On Tue, Dec 20,

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Thank you for the suggestion. This would, however, involve converting my Dataframe to an RDD (and back later), which involves additional costs. On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh wrote: > you can groupBy(country). and use mapPartitions method in which you can > iterate over all

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Hello, Let's say the data is like this: +---+---++ | country | city | population | +---+---++ | Germany | Berlin| 3520031| | Germany | Hamburg | 1787408

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Raghavendra Ganesh
you can groupBy(country). and use mapPartitions method in which you can iterate over all rows keeping 2 variables for maxPopulationSoFar and corresponding city. Then return the city with max population. I think as others suggested, it may be possible to use Bucketing, it would give a more friendly

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Mich Talebzadeh
Hi, Windowing functions were invented to avoid doing lengthy group by etc. As usual there is a lot of heat but little light Please provide: 1. Sample input. I gather this data is stored in some csv, tsv, table format 2. The output that you would like to see. Have a look at this