[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489579#comment-17489579 ]
Willi Raschkowski commented on SPARK-38166: ------------------------------------------- Linking SPARK-23207 (which is closed but looks very related) and SPARK-25342 (which is open but I understand would only explain this if we were operating on RDDs). > Duplicates after task failure in dropDuplicates and repartition > --------------------------------------------------------------- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. > Reporter: Willi Raschkowski > Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org