Willi Raschkowski created SPARK-38166: -----------------------------------------
Summary: Duplicates after task failure in dropDuplicates and repartition Key: SPARK-38166 URL: https://issues.apache.org/jira/browse/SPARK-38166 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2 Environment: Cluster runs on K8s. AQE is enabled. Reporter: Willi Raschkowski We're seeing duplicates after running the following {code} def compute_shipments(shipments): shipments = shipments.dropDuplicates(["ship_trck_num"]) shipments = shipments.repartition(4) return shipments {code} and observing lost executors (OOMs) and task retries in the repartition stage. We're seeing this reliably in one of our pipelines. But I haven't managed to reproduce outside of that pipeline. I'll attach driver logs and the notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org