[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489523#comment-17489523 ]
Willi Raschkowski commented on SPARK-38166: ------------------------------------------- Attaching driver logs: [^driver.log] Notable lines are probably: {code:java} ... INFO [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). INFO [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) INFO [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead. at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663) ... Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) ... INFO [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) due to fetch failure INFO [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting failed stages ... {code} > Duplicates after task failure in dropDuplicates and repartition > --------------------------------------------------------------- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. > Reporter: Willi Raschkowski > Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org