You can use selectExpr and stack to achieve the same effect in PySpark:
df = spark.read.csv("your_file.csv", header=True, inferSchema=True)
date_columns = [col for col in df.columns if '/' in col]
df = df.selectExpr(["`Employee ID`", "`Name`", "`Client`", "`Project`",
"`Team`”]
+ [f"stack({len(date_columns)}, {', '.join([f'`{col}`, `{col}` as
`Status`' for col in date_columns])}) as (`Date`, `Status`)”])
result = df.groupby("Date", "Status").count()
> On 21 Jun 2023, at 11:45, John Paul Jayme <[email protected]> wrote:
>
> Hi,
>
> This is currently my column definition :
> Employee ID Name Client Project Team 01/01/2022 02/01/2022
> 03/01/2022 04/01/2022 05/01/2022
> 12345 Dummy x Dummy a abc team a OFF WO WH WH
> WH
>
> As you can see, the outer columns are just daily attendance dates. My goal is
> to count the employees who were OFF / WO / WH on said dates. I need to
> transpose them so it would look like this :
>
> <image.png>
>
> I am still new to pandas. Can you guide me on how to produce this? I am
> reading about melt() and set_index() but I am not sure if they are the
> correct functions to use.