Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Mich Talebzadeh
In spark you can use windowing function
s to
achieve this

HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker 
wrote:

>
>  Hello,
>
>   How can I retain from each group only the row for which one value is the
> maximum of the group? For example, imagine a DataFrame containing all major
> cities in the world, with three columns: (1) City name (2) Country (3)
> population. How would I get a DataFrame that only contains the largest city
> in each country? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
 Hello,

  Thank you for the response!

  I can think of two ways to get the largest city by country, but both seem
to be inefficient:

  (1) I could group by country, sort each group by population, add the row
number within each group, and then retain only cities with a row number
equal to 1. But it seems wasteful to sort everything when I only want the
largest of each country

  (2) I could group by country, get the maximum city population for each
country, join that with the original data frame, and then retain only
cities with population equal to the maximum population in the country. But
that seems also expensive because I need to join.

  Am I missing something?

  Thanks!

 Best, Oliver

On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh 
wrote:

> In spark you can use windowing function
> s to
> achieve this
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   How can I retain from each group only the row for which one value is
>> the maximum of the group? For example, imagine a DataFrame containing all
>> major cities in the world, with three columns: (1) City name (2) Country
>> (3) population. How would I get a DataFrame that only contains the largest
>> city in each country? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Sean Owen
As Mich says, isn't this just max by population partitioned by country in a
window function?

On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker 
wrote:

>
>  Hello,
>
>   Thank you for the response!
>
>   I can think of two ways to get the largest city by country, but both
> seem to be inefficient:
>
>   (1) I could group by country, sort each group by population, add the row
> number within each group, and then retain only cities with a row number
> equal to 1. But it seems wasteful to sort everything when I only want the
> largest of each country
>
>   (2) I could group by country, get the maximum city population for each
> country, join that with the original data frame, and then retain only
> cities with population equal to the maximum population in the country. But
> that seems also expensive because I need to join.
>
>   Am I missing something?
>
>   Thanks!
>
>  Best, Oliver
>
> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> In spark you can use windowing function
>> s to
>> achieve this
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   How can I retain from each group only the row for which one value is
>>> the maximum of the group? For example, imagine a DataFrame containing all
>>> major cities in the world, with three columns: (1) City name (2) Country
>>> (3) population. How would I get a DataFrame that only contains the largest
>>> city in each country? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
If we only wanted to know the biggest population, max function would
suffice. The problem is I also want the name of the city with the biggest
population.

On Mon, Dec 19, 2022 at 11:58 AM Sean Owen  wrote:

> As Mich says, isn't this just max by population partitioned by country in
> a window function?
>
> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Thank you for the response!
>>
>>   I can think of two ways to get the largest city by country, but both
>> seem to be inefficient:
>>
>>   (1) I could group by country, sort each group by population, add the
>> row number within each group, and then retain only cities with a row number
>> equal to 1. But it seems wasteful to sort everything when I only want the
>> largest of each country
>>
>>   (2) I could group by country, get the maximum city population for each
>> country, join that with the original data frame, and then retain only
>> cities with population equal to the maximum population in the country. But
>> that seems also expensive because I need to join.
>>
>>   Am I missing something?
>>
>>   Thanks!
>>
>>  Best, Oliver
>>
>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> In spark you can use windowing function
>>> s to
>>> achieve this
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   How can I retain from each group only the row for which one value is
 the maximum of the group? For example, imagine a DataFrame containing all
 major cities in the world, with three columns: (1) City name (2) Country
 (3) population. How would I get a DataFrame that only contains the largest
 city in each country? Thanks!

  Best, Oliver

 --
 Oliver Ruebenacker, Ph.D. (he)
 Senior Software Engineer, Knowledge Portal Network ,
 Flannick Lab , Broad Institute
 

>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to
partition data and pull any relevant column, whether it's used in the
partition or not.

I'm not sure what the syntax is for PySpark, but the standard SQL would be
something like this:

WITH InputData AS
(
  SELECT 'USA' Country, 'New York' City, 900 Population
  UNION
  SELECT 'USA' Country, 'Miami', 620 Population
  UNION
  SELECT 'Ukraine' Country, 'Kyiv', 300 Population
  UNION
  SELECT 'Ukraine' Country, 'Kharkiv', 140 Population
)

 SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population DESC)
PopulationRank
 FROM InputData;

Results would be something like this:

CountryCity   Population PopulationRank
UkraineKyiv   3001
UkraineKharkiv1402
USANew York   9001
USAMiami  6202

Which you could further filter in another CTE or subquery where
PopulationRank = 1.

As I mentioned, I'm not sure how this translates into PySpark, but that's
the general concept in SQL.

On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

> If we only wanted to know the biggest population, max function would
> suffice. The problem is I also want the name of the city with the biggest
> population.
>
> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen  wrote:
>
>> As Mich says, isn't this just max by population partitioned by country in
>> a window function?
>>
>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   Thank you for the response!
>>>
>>>   I can think of two ways to get the largest city by country, but both
>>> seem to be inefficient:
>>>
>>>   (1) I could group by country, sort each group by population, add the
>>> row number within each group, and then retain only cities with a row number
>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>> largest of each country
>>>
>>>   (2) I could group by country, get the maximum city population for each
>>> country, join that with the original data frame, and then retain only
>>> cities with population equal to the maximum population in the country. But
>>> that seems also expensive because I need to join.
>>>
>>>   Am I missing something?
>>>
>>>   Thanks!
>>>
>>>  Best, Oliver
>>>
>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 In spark you can use windowing function
 s to
 achieve this

 HTH


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   How can I retain from each group only the row for which one value is
> the maximum of the group? For example, imagine a DataFrame containing all
> major cities in the world, with three columns: (1) City name (2) Country
> (3) population. How would I get a DataFrame that only contains the largest
> city in each country? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network ,
> Flannick Lab , Broad Institute
> 
>

>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
We have pandas API on spark

which is very good.

from pyspark import pandas as ps

You can use pdf = df.pandas_api()
Where df is your pyspark dataframe.


[image: image.png]

Does this help you?

df.groupby(['Country'])[['Population', 'City']].max()

man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci :

> Window functions don't work like traditional GROUP BYs. They allow you to
> partition data and pull any relevant column, whether it's used in the
> partition or not.
>
> I'm not sure what the syntax is for PySpark, but the standard SQL would be
> something like this:
>
> WITH InputData AS
> (
>   SELECT 'USA' Country, 'New York' City, 900 Population
>   UNION
>   SELECT 'USA' Country, 'Miami', 620 Population
>   UNION
>   SELECT 'Ukraine' Country, 'Kyiv', 300 Population
>   UNION
>   SELECT 'Ukraine' Country, 'Kharkiv', 140 Population
> )
>
>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
> DESC) PopulationRank
>  FROM InputData;
>
> Results would be something like this:
>
> CountryCity   Population PopulationRank
> UkraineKyiv   3001
> UkraineKharkiv1402
> USANew York   9001
> USAMiami  6202
>
> Which you could further filter in another CTE or subquery where
> PopulationRank = 1.
>
> As I mentioned, I'm not sure how this translates into PySpark, but that's
> the general concept in SQL.
>
> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>> If we only wanted to know the biggest population, max function would
>> suffice. The problem is I also want the name of the city with the biggest
>> population.
>>
>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen  wrote:
>>
>>> As Mich says, isn't this just max by population partitioned by country
>>> in a window function?
>>>
>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   Thank you for the response!

   I can think of two ways to get the largest city by country, but both
 seem to be inefficient:

   (1) I could group by country, sort each group by population, add the
 row number within each group, and then retain only cities with a row number
 equal to 1. But it seems wasteful to sort everything when I only want the
 largest of each country

   (2) I could group by country, get the maximum city population for
 each country, join that with the original data frame, and then retain only
 cities with population equal to the maximum population in the country. But
 that seems also expensive because I need to join.

   Am I missing something?

   Thanks!

  Best, Oliver

 On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> In spark you can use windowing function
> s to
> achieve this
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   How can I retain from each group only the row for which one value
>> is the maximum of the group? For example, imagine a DataFrame containing
>> all major cities in the world, with three columns: (1) City name (2)
>> Country (3) population. How would I get a DataFrame that only contains 
>> the
>> largest city in each country? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network
>> , Flannick Lab , Broad
>> Institute 
>>
>

 --
 Oliver Ruebenacker, Ph.D. (he)
 Senior Software Engineer, Knowledge Portal Network ,
 Flannick Lab , Broad Institute
 

>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4,

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Thank you, that is an interesting idea. Instead of finding the maximum
population, we are finding the maximum (population, city name) tuple.

On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen 
wrote:

> We have pandas API on spark
> 
> which is very good.
>
> from pyspark import pandas as ps
>
> You can use pdf = df.pandas_api()
> Where df is your pyspark dataframe.
>
>
> [image: image.png]
>
> Does this help you?
>
> df.groupby(['Country'])[['Population', 'City']].max()
>
> man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci  >:
>
>> Window functions don't work like traditional GROUP BYs. They allow you to
>> partition data and pull any relevant column, whether it's used in the
>> partition or not.
>>
>> I'm not sure what the syntax is for PySpark, but the standard SQL would
>> be something like this:
>>
>> WITH InputData AS
>> (
>>   SELECT 'USA' Country, 'New York' City, 900 Population
>>   UNION
>>   SELECT 'USA' Country, 'Miami', 620 Population
>>   UNION
>>   SELECT 'Ukraine' Country, 'Kyiv', 300 Population
>>   UNION
>>   SELECT 'Ukraine' Country, 'Kharkiv', 140 Population
>> )
>>
>>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
>> DESC) PopulationRank
>>  FROM InputData;
>>
>> Results would be something like this:
>>
>> CountryCity   Population PopulationRank
>> UkraineKyiv   3001
>> UkraineKharkiv1402
>> USANew York   9001
>> USAMiami  6202
>>
>> Which you could further filter in another CTE or subquery where
>> PopulationRank = 1.
>>
>> As I mentioned, I'm not sure how this translates into PySpark, but that's
>> the general concept in SQL.
>>
>> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>> If we only wanted to know the biggest population, max function would
>>> suffice. The problem is I also want the name of the city with the biggest
>>> population.
>>>
>>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen  wrote:
>>>
 As Mich says, isn't this just max by population partitioned by country
 in a window function?

 On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   Thank you for the response!
>
>   I can think of two ways to get the largest city by country, but both
> seem to be inefficient:
>
>   (1) I could group by country, sort each group by population, add the
> row number within each group, and then retain only cities with a row 
> number
> equal to 1. But it seems wasteful to sort everything when I only want the
> largest of each country
>
>   (2) I could group by country, get the maximum city population for
> each country, join that with the original data frame, and then retain only
> cities with population equal to the maximum population in the country. But
> that seems also expensive because I need to join.
>
>   Am I missing something?
>
>   Thanks!
>
>  Best, Oliver
>
> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> In spark you can use windowing function
>> s to
>> achieve this
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   How can I retain from each group only the row for which one value
>>> is the maximum of the group? For example, imagine a DataFrame containing
>>> all major cities in the world, with three columns: (1) City name (2)
>>> Country (3) population. How would I get a DataFrame that only contains 
>>> the
>>> largest city in each country? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network
>>> , Flannick Lab , Broad
>>> Institute 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network ,
>

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
Post an example dataframe and how you will have the result.

man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker <
oliv...@broadinstitute.org>:

> Thank you, that is an interesting idea. Instead of finding the maximum
> population, we are finding the maximum (population, city name) tuple.
>
> On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen 
> wrote:
>
>> We have pandas API on spark
>> 
>> which is very good.
>>
>> from pyspark import pandas as ps
>>
>> You can use pdf = df.pandas_api()
>> Where df is your pyspark dataframe.
>>
>>
>> [image: image.png]
>>
>> Does this help you?
>>
>> df.groupby(['Country'])[['Population', 'City']].max()
>>
>> man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci > >:
>>
>>> Window functions don't work like traditional GROUP BYs. They allow you
>>> to partition data and pull any relevant column, whether it's used in the
>>> partition or not.
>>>
>>> I'm not sure what the syntax is for PySpark, but the standard SQL would
>>> be something like this:
>>>
>>> WITH InputData AS
>>> (
>>>   SELECT 'USA' Country, 'New York' City, 900 Population
>>>   UNION
>>>   SELECT 'USA' Country, 'Miami', 620 Population
>>>   UNION
>>>   SELECT 'Ukraine' Country, 'Kyiv', 300 Population
>>>   UNION
>>>   SELECT 'Ukraine' Country, 'Kharkiv', 140 Population
>>> )
>>>
>>>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
>>> DESC) PopulationRank
>>>  FROM InputData;
>>>
>>> Results would be something like this:
>>>
>>> CountryCity   Population PopulationRank
>>> UkraineKyiv   3001
>>> UkraineKharkiv1402
>>> USANew York   9001
>>> USAMiami  6202
>>>
>>> Which you could further filter in another CTE or subquery where
>>> PopulationRank = 1.
>>>
>>> As I mentioned, I'm not sure how this translates into PySpark, but
>>> that's the general concept in SQL.
>>>
>>> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>
 If we only wanted to know the biggest population, max function would
 suffice. The problem is I also want the name of the city with the biggest
 population.

 On Mon, Dec 19, 2022 at 11:58 AM Sean Owen  wrote:

> As Mich says, isn't this just max by population partitioned by country
> in a window function?
>
> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Thank you for the response!
>>
>>   I can think of two ways to get the largest city by country, but
>> both seem to be inefficient:
>>
>>   (1) I could group by country, sort each group by population, add
>> the row number within each group, and then retain only cities with a row
>> number equal to 1. But it seems wasteful to sort everything when I only
>> want the largest of each country
>>
>>   (2) I could group by country, get the maximum city population for
>> each country, join that with the original data frame, and then retain 
>> only
>> cities with population equal to the maximum population in the country. 
>> But
>> that seems also expensive because I need to join.
>>
>>   Am I missing something?
>>
>>   Thanks!
>>
>>  Best, Oliver
>>
>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> In spark you can use windowing function
>>> s to
>>> achieve this
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property which 
>>> may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary 
>>> damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   How can I retain from each group only the row for which one value
 is the maximum of the group? For example, imagine a DataFrame 
 containing
 all major cities in the world, with three columns: (1) City name (2)
 Country (3) population. How would I get a DataFrame that only contains 
 the
 largest city in each country? Thanks!

  Best, Oliver

 --
 Oliver Ruebenacker, Ph.D. (he)
 Senior

Re: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-19 Thread Eric Hanchrow
We’ve discovered a workaround for this; it’s described 
here.

From: Eric Hanchrow 
Date: Thursday, December 8, 2022 at 17:03
To: user@spark.apache.org 
Subject: [Spark SQL]: unpredictable errors: java.io.IOException: can not read 
class org.apache.parquet.format.PageHeader
My company runs java code that uses Spark to read from, and write to, Azure 
Blob storage.  This code runs more or less 24x7.

Recently we've noticed a few failures that leave stack traces in our logs; what 
they have in common are exceptions that look variously like

Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader: Unrecognized type 0
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader : don't know what type: 14
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'num_values' was not found 
in serialized data!
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader Required field 'uncompressed_page_size' 
was not found in serialized data!

I searched 
https://stackoverflow.com/search?q=%5Bapache-spark%5D+java.io.IOException+can+not+read+class+org.apache.parquet.format.PageHeader
 and found exactly one marginally-relevant hit -- 
https://stackoverflow.com/questions/47211392/required-field-uncompressed-page-size-was-not-found-in-serialized-data-parque
It contains a suggested workaround which I haven't yet tried, but intend to 
soon.

I searched the ASF archive for 
user@spark.apache.org; the only hit is 
https://lists.apache.org/list?user@spark.apache.org:2022-9:can%20not%20read%20class%20org.apache.parquet.format.PageHeader
 which is relevant but unhelpful.

It cites https://issues.apache.org/jira/browse/SPARK-11844 which is quite 
relevant, but again unhelpful.

Unfortunately, we cannot provide the relevant parquet file to the mailing list, 
since it of course contains proprietary data.

I've posted the stack trace at 
https://gist.github.com/erich-truveta/f30d77441186a8c30c5f22f9c44bf59f

Here are various maven dependencies that might be relevant (gotten from the 
output of `mvn dependency:tree`):

org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1
org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7 :jar:1.1.1

org.apache.hadoop:hadoop-annotations:jar:3.3.4
org.apache.hadoop:hadoop-auth   :jar:3.3.4
org.apache.hadoop:hadoop-azure  :jar:3.3.4
org.apache.hadoop:hadoop-client-api :jar:3.3.4
org.apache.hadoop:hadoop-client-runtime :jar:3.3.4
org.apache.hadoop:hadoop-client :jar:3.3.4
org.apache.hadoop:hadoop-common :jar:3.3.4
org.apache.hadoop:hadoop-hdfs-client:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-common:jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-core  :jar:3.3.4
org.apache.hadoop:hadoop-mapreduce-client-jobclient :jar:3.3.4
org.apache.hadoop:hadoop-yarn-api   :jar:3.3.4
org.apache.hadoop:hadoop-yarn-client:jar:3.3.4
org.apache.hadoop:hadoop-yarn-common:jar:3.3.4

org.apache.hive:hive-storage-api :jar:2.7.2

org.apache.parquet:parquet-column:jar:1.12.2
org.apache.parquet:parquet-common:jar:1.12.2
org.apache.parquet:parquet-encoding  :jar:1.12.2
org.apache.parquet:parquet-format-structures :jar:1.12.2
org.apache.parquet:parquet-hadoop:jar:1.12.2
org.apache.parquet:parquet-jackson   :jar:1.12.2

org.apache.spark:spark-catalyst_2.12:jar:3.3.1
org.apache.spark:spark-core_2.12:jar:3.3.1
org.apache.spark:spark-kvstore_2.12 :jar:3.3.1
org.apache.spark:spark-launcher_2.12:jar:3.3.1
org.apache.spark:spark-network-common_2.12  :jar:3.3.1
org.apache.spark:spark-network-shuffle_2.12 :jar:3.3.1
org.apache.spark:spark-sketch_2.12  :jar:3.3.1
org.apache.spark:spark-sql_2.12 :jar:3.3.1
org.apache.spark:spark-tags_2.12:jar:3.3.1
org.apache.spark:spark-unsafe_2.12  :jar:3.3.1

Thank you for any help you can provide!