Re: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-19 Thread Eric Hanchrow
We’ve discovered a workaround for this; it’s described here. From: Eric Hanchrow Date: Thursday, December 8, 2022 at 17:03 To: user@spark.apache.org Subject: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
Post an example dataframe and how you will have the result. man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you, that is an interesting idea. Instead of finding the maximum > population, we are finding the maximum (population, city name) tuple. > > On

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Thank you, that is an interesting idea. Instead of finding the maximum population, we are finding the maximum (population, city name) tuple. On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen wrote: > We have pandas API on spark >

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
We have pandas API on spark which is very good. from pyspark import pandas as ps You can use pdf = df.pandas_api() Where df is your pyspark dataframe. [image: image.png] Does this help you?

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to partition data and pull any relevant column, whether it's used in the partition or not. I'm not sure what the syntax is for PySpark, but the standard SQL would be something like this: WITH InputData AS ( SELECT 'USA'

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
If we only wanted to know the biggest population, max function would suffice. The problem is I also want the name of the city with the biggest population. On Mon, Dec 19, 2022 at 11:58 AM Sean Owen wrote: > As Mich says, isn't this just max by population partitioned by country in > a window

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Sean Owen
As Mich says, isn't this just max by population partitioned by country in a window function? On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker wrote: > > Hello, > > Thank you for the response! > > I can think of two ways to get the largest city by country, but both > seem to be

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Hello, Thank you for the response! I can think of two ways to get the largest city by country, but both seem to be inefficient: (1) I could group by country, sort each group by population, add the row number within each group, and then retain only cities with a row number equal to 1.

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Mich Talebzadeh
In spark you can use windowing function s to achieve this HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own