Apply pivot only on some columns in pyspark

Dhruv Singla Sun, 09 Mar 2025 07:11:36 -0700

Hi Everyone

Hope you are doing well


I have the following dataframe.

df = spark.createDataFrame(
    [
        [1, 'AB', 12, '2022-01-01']
        , [1, 'AA', 22, '2022-01-10']
        , [1, 'AC', 11, '2022-01-11']
        , [2, 'AB', 22, '2022-02-01']
        , [2, 'AA', 28, '2022-02-10']
        , [2, 'AC', 25, '2022-02-22']
    ]
    , 'code: int, doc_type: string, amount: int, load_date: string'
)
df = df.withColumn('load_date', F.to_date('load_date'))

I want to pivot the amount but just want the first value from the date.
This is what I tried and it is not giving me the desried results.

(
    df.groupBy('code')
    .pivot('doc_type', ['AB', 'AA', 'AC'])
    .agg(F.sum('amount').alias('amnt'), F.first('load_date').alias('ldt'))
    .show()
)

+----+-------+----------+-------+----------+-------+----------+
|code|AB_amnt|    AB_ldt|AA_amnt|    AA_ldt|AC_amnt|    AC_ldt|
+----+-------+----------+-------+----------+-------+----------+
|   1|     12|2022-01-01|     22|2022-01-10|     11|2022-01-11|
|   2|     22|2022-02-01|     28|2022-02-10|     25|2022-02-22|
+----+-------+----------+-------+----------+-------+----------+

This is what I want.

(
    df.groupBy('code')
    .agg(
        F.sum(F.when(F.col('doc_type') == 'AB',
F.col('amount'))).alias('AB_amnt')
        , F.sum(F.when(F.col('doc_type') == 'AA',
F.col('amount'))).alias('AA_amnt')
        , F.sum(F.when(F.col('doc_type') == 'AC',
F.col('amount'))).alias('AC_amnt')
        , F.first('load_date').alias('load_date')
    )
    .show()
)

+----+-------+-------+-------+----------+
|code|AB_amnt|AA_amnt|AC_amnt| load_date|
+----+-------+-------+-------+----------+
|   1|     12|     22|     11|2022-01-01|
|   2|     22|     28|     25|2022-02-01|
+----+-------+-------+-------+----------+

Is there any simpler way to do it? I have more than one column to put into
pivot and also to put into non pivot.

I am using Databricks 14.3 LTS with Spark 3.5.0

Thanks & Regards
Dhruv

Apply pivot only on some columns in pyspark

Reply via email to