Hello there, I ran into this problem on pyspark: when using the groupby.cogroup functionality on the same dataframe, it silently drops columns from the first instance, minimal example: spark = ( SparkSession.builder .getOrCreate() )
df = spark.createDataFrame([["2017-08-17", 1,]], schema=["day", "value"]).cache() def in_pandas(df1, df2): assert "value" in df1.columns return df1 df = ( df .groupby("day") .cogroup(df.groupby("day")) .applyInPandas( in_pandas, schema=df.schema, ) ) df.show(20, False) Fails on assertion error.... My versions: import pyspark.version import pandas as pd import pyarrow print(sys.version) # 3.8.10 (default, Jun 22 2022, 20:18:18) # [GCC 9.4.0] print(pyspark.version.__version__) # 3.3.1 print(pd.__version__) # 1.5.2 print(pyarrow.__version__) # 10.0.1 It works on AWS Glue session with these versions: [image: image.png] It prints: +----------+-----+ |day |value| +----------+-----+ |2017-08-17|1 | +----------+-----+ as expected. Thank you, Michael