Try this one: "select country, city, max(population) from your_table
group by country"
Please note this returns a table of three columns, instead of two. This
is a standard SQL query, and supported by Spark as well.
On 12/20/22 3:35 PM, Oliver Ruebenacker wrote:
Hello,
Let's say
https://github.com/apache/spark/pull/39134
tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker <
oliv...@broadinstitute.org>:
> Thank you for the suggestion. This would, however, involve converting my
> Dataframe to an RDD (and back later), which involves additional costs.
>
> On Tue, Dec 20,
Thank you for the suggestion. This would, however, involve converting my
Dataframe to an RDD (and back later), which involves additional costs.
On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh
wrote:
> you can groupBy(country). and use mapPartitions method in which you can
> iterate over all
Hello,
Let's say the data is like this:
+---+---++
| country | city | population |
+---+---++
| Germany | Berlin| 3520031|
| Germany | Hamburg | 1787408
you can groupBy(country). and use mapPartitions method in which you can
iterate over all rows keeping 2 variables for maxPopulationSoFar and
corresponding city. Then return the city with max population.
I think as others suggested, it may be possible to use Bucketing, it would
give a more friendly
Hi,
Windowing functions were invented to avoid doing lengthy group by etc.
As usual there is a lot of heat but little light
Please provide:
1. Sample input. I gather this data is stored in some csv, tsv, table
format
2. The output that you would like to see.
Have a look at this