Why this can not be done by window function? Or is min by is just a short hand?
On Tue, 22 Feb 2022 at 12:42 am, Sean Owen <sro...@gmail.com> wrote: > From the source code, looks like this function was added to pyspark in > Spark 3.3, up for release soon. It exists in SQL. You can still use it in > SQL with `spark.sql(...)` in Python though, not hard. > > On Mon, Feb 21, 2022 at 4:01 AM David Diebold <davidjdieb...@gmail.com> > wrote: > >> Hello all, >> >> I'm trying to use the spark.sql min_by aggregation function with pyspark. >> I'm relying on this distribution of spark : spark-3.2.1-bin-hadoop3.2 >> >> I have a dataframe made of these columns: >> - productId : int >> - sellerId : int >> - price : double >> >> For each product, I want to get the seller who sells the product for the >> cheapest price. >> >> Naive approach would be to do this, but I would expect two shuffles: >> >> import spark.sql.functions as F >> cheapest_prices_df = >> df.groupby('productId').agg(F.min('price').alias('price')) >> cheapest_sellers_df = df.join(cheapest_prices_df, on=['productId', >> 'price']) >> >> I would had loved to do this instead : >> >> import spark.sql.functions as F >> cheapest_sellers_df = df.groupby('productId').agg(F.min('price'), >> F.min_by('sellerId', 'price')) >> >> Unfortunately min_by does not seem available in pyspark sql functions, >> whereas I can see it in the doc : >> https://spark.apache.org/docs/latest/api/sql/index.html >> >> I have managed to use min_by with this approach but it looks slow (maybe >> because of temp table creation ?): >> >> df.createOrReplaceTempView("table") >> cheapest_sellers_df = spark.sql("select min_by(sellerId, price) sellerId, >> min(price) from table group by productId") >> >> Is there a way I can rely on min_by directly in groupby ? >> Is there some code missing in pyspark wrapper to make min_by visible >> somehow ? >> >> Thank you in advance for your help. >> >> Cheers >> David >> > -- Best Regards, Ayan Guha