[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Willi Raschkowski updated SPARK-38166: -------------------------------------- Attachment: driver.log > Duplicates after task failure in dropDuplicates and repartition > --------------------------------------------------------------- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. > Reporter: Willi Raschkowski > Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org