Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Artemis User
Try this one:  "select country, city, max(population) from your_table 
group by country"


Please note this returns a table of three columns, instead of two. This 
is a standard SQL query, and supported by Spark as well.


On 12/20/22 3:35 PM, Oliver Ruebenacker wrote:


 Hello,

  Let's say the data is like this:

+---+---++
| country   | city  | population |
+---+---++
| Germany   | Berlin | 3520031    |
| Germany   | Hamburg | 1787408    |
| Germany   | Munich | 1450381    |
| Turkey    | Ankara | 4587558    |
| Turkey    | Istanbul | 14025646   |
| Turkey    | Izmir | 2847691    |
| United States | Chicago, IL | 2670406    |
| United States | Los Angeles, CA | 4085014    |
| United States | New York City, NY | 8622357    |
+---+---++

I want to get the largest city in each country:

+---+---+
| country   | city  |
+---+---+
| Germany   | Berlin    |
| Turkey    | Istanbul    |
| United States | New York City, NY |
+---+---+

Thanks!

 Best, Oliver

On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh 
 wrote:


Hi,

Windowing functions were invented to avoid doing lengthy group by etc.

As usual there is a lot of heat but little light

Please provide:

 1. Sample input. I gather this data is stored in some csv, tsv,
table format
 2. The output that you would like to see.


Have a look at this article of mine Technical Analysis of the
latest UK House Price Index, Deploying Modern tools




The PySpark code and windowing functions arehere




HTH


**view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other property
which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for
any monetary damages arising from such loss, damage or destruction.



On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker
 wrote:


 Hello,

  Thank you for the response!

  I can think of two ways to get the largest city by country,
but both seem to be inefficient:

  (1) I could group by country, sort each group by population,
add the row number within each group, and then retain only
cities with a row number equal to 1. But it seems wasteful to
sort everything when I only want the largest of each country

  (2) I could group by country, get the maximum city
population for each country, join that with the original data
frame, and then retain only cities with population equal to
the maximum population in the country. But that seems also
expensive because I need to join.

  Am I missing something?

  Thanks!

 Best, Oliver

On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh
 wrote:

In spark you can use windowing function
s to
achieve this

HTH


**view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk.Any and all
responsibility for any loss, damage or destruction of data
or any other property which may arise from relying on this
email's technical content is explicitly disclaimed. The
author will in no case be liable for any monetary damages
arising from such loss, damage or destruction.



On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker
 wrote:


 Hello,

  How can I retain from each group only the row for
which one value is the maximum of the group? For
example, imagine a DataFrame containing all major
cities in the world, with three columns: (1) City name
(2) Country (3) population. How would I get a
DataFrame that only contains the largest city in each
country? Thanks!

 Best, Oliver

-- 
Oliver Ruebenacker, Ph.D. (he)

Senior Software Engineer, Knowledge Portal Network
, Flannick Lab
   

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Bjørn Jørgensen
https://github.com/apache/spark/pull/39134

tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker <
oliv...@broadinstitute.org>:

> Thank you for the suggestion. This would, however, involve converting my
> Dataframe to an RDD (and back later), which involves additional costs.
>
> On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh <
> raghavendr...@gmail.com> wrote:
>
>> you can groupBy(country). and use mapPartitions method in which you can
>> iterate over all rows keeping 2 variables for maxPopulationSoFar and
>> corresponding city. Then return the city with max population.
>> I think as others suggested, it may be possible to use Bucketing, it
>> would give a more friendly SQL'ish way of doing and but not be the best in
>> performance as it needs to order/sort.
>> --
>> Raghavendra
>>
>>
>> On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Thank you for the suggestion. This would, however, involve converting my
Dataframe to an RDD (and back later), which involves additional costs.

On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh 
wrote:

> you can groupBy(country). and use mapPartitions method in which you can
> iterate over all rows keeping 2 variables for maxPopulationSoFar and
> corresponding city. Then return the city with max population.
> I think as others suggested, it may be possible to use Bucketing, it would
> give a more friendly SQL'ish way of doing and but not be the best in
> performance as it needs to order/sort.
> --
> Raghavendra
>
>
> On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   How can I retain from each group only the row for which one value is
>> the maximum of the group? For example, imagine a DataFrame containing all
>> major cities in the world, with three columns: (1) City name (2) Country
>> (3) population. How would I get a DataFrame that only contains the largest
>> city in each country? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
 Hello,

  Let's say the data is like this:

+---+---++
| country   | city  | population |
+---+---++
| Germany   | Berlin| 3520031|
| Germany   | Hamburg   | 1787408|
| Germany   | Munich| 1450381|
| Turkey| Ankara| 4587558|
| Turkey| Istanbul  | 14025646   |
| Turkey| Izmir | 2847691|
| United States | Chicago, IL   | 2670406|
| United States | Los Angeles, CA   | 4085014|
| United States | New York City, NY | 8622357|
+---+---++

I want to get the largest city in each country:

+---+---+
| country   | city  |
+---+---+
| Germany   | Berlin|
| Turkey| Istanbul  |
| United States | New York City, NY |
+---+---+

Thanks!

 Best, Oliver

On Tue, Dec 20, 2022 at 5:52 AM Mich Talebzadeh 
wrote:

> Hi,
>
> Windowing functions were invented to avoid doing lengthy group by etc.
>
> As usual there is a lot of heat but little light
>
> Please provide:
>
>
>1. Sample input. I gather this data is stored in some csv, tsv, table
>format
>2. The output that you would like to see.
>
>
> Have a look at this article of mine  Technical Analysis of the latest UK
> House Price Index, Deploying Modern tools
> 
>
>
> The PySpark code and windowing functions are here
> 
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Thank you for the response!
>>
>>   I can think of two ways to get the largest city by country, but both
>> seem to be inefficient:
>>
>>   (1) I could group by country, sort each group by population, add the
>> row number within each group, and then retain only cities with a row number
>> equal to 1. But it seems wasteful to sort everything when I only want the
>> largest of each country
>>
>>   (2) I could group by country, get the maximum city population for each
>> country, join that with the original data frame, and then retain only
>> cities with population equal to the maximum population in the country. But
>> that seems also expensive because I need to join.
>>
>>   Am I missing something?
>>
>>   Thanks!
>>
>>  Best, Oliver
>>
>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> In spark you can use windowing function
>>> s to
>>> achieve this
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   How can I retain from each group only the row for which one value is
 the maximum of the group? For example, imagine a DataFrame containing all
 major cities in the world, with three columns: (1) City name (2) Country
 (3) population. How would I get a DataFrame that only contains the largest
 city in each country? Thanks!

  Best, Oliver

 --
 Oliver Ruebenacker, Ph.D. (he)
 Senior Software Engineer, Knowledge Portal Network ,
 Flannick Lab , Broad Institute
 

>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network


Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Raghavendra Ganesh
you can groupBy(country). and use mapPartitions method in which you can
iterate over all rows keeping 2 variables for maxPopulationSoFar and
corresponding city. Then return the city with max population.
I think as others suggested, it may be possible to use Bucketing, it would
give a more friendly SQL'ish way of doing and but not be the best in
performance as it needs to order/sort.
--
Raghavendra


On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   How can I retain from each group only the row for which one value is the
> maximum of the group? For example, imagine a DataFrame containing all major
> cities in the world, with three columns: (1) City name (2) Country (3)
> population. How would I get a DataFrame that only contains the largest city
> in each country? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Mich Talebzadeh
Hi,

Windowing functions were invented to avoid doing lengthy group by etc.

As usual there is a lot of heat but little light

Please provide:


   1. Sample input. I gather this data is stored in some csv, tsv, table
   format
   2. The output that you would like to see.


Have a look at this article of mine  Technical Analysis of the latest UK
House Price Index, Deploying Modern tools



The PySpark code and windowing functions are here



HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 19 Dec 2022 at 16:44, Oliver Ruebenacker 
wrote:

>
>  Hello,
>
>   Thank you for the response!
>
>   I can think of two ways to get the largest city by country, but both
> seem to be inefficient:
>
>   (1) I could group by country, sort each group by population, add the row
> number within each group, and then retain only cities with a row number
> equal to 1. But it seems wasteful to sort everything when I only want the
> largest of each country
>
>   (2) I could group by country, get the maximum city population for each
> country, join that with the original data frame, and then retain only
> cities with population equal to the maximum population in the country. But
> that seems also expensive because I need to join.
>
>   Am I missing something?
>
>   Thanks!
>
>  Best, Oliver
>
> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> In spark you can use windowing function
>> s to
>> achieve this
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>