[ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489523#comment-17489523
 ] 

Willi Raschkowski commented on SPARK-38166:
-------------------------------------------

Attaching driver logs:  [^driver.log]

Notable lines are probably:
{code:java}
...
INFO  [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: 
Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed 
(either because the task failed with a shuffle data fetch failure, so the 
previous stage needs to be re-run, or because a different copy of the task has 
already succeeded).
INFO  [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: 
Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as 
failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at 
CustomSaveDatasetCommand.scala:218)
INFO  [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: 
ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 
1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative 
remote executor(Id: 2), which maintains the block data to fetch is dead.
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
        at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663)
        ...
Caused by: org.apache.spark.ExecutorDeadException: The relative remote 
executor(Id: 2), which maintains the block data to fetch is dead.
        at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132)
        at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
        ...

INFO  [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: 
Resubmitting ShuffleMapStage 5 (writeAndRead at 
CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at 
CustomSaveDatasetCommand.scala:218) due to fetch failure
INFO  [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: 
Resubmitting failed stages
...
{code}

> Duplicates after task failure in dropDuplicates and repartition
> ---------------------------------------------------------------
>
>                 Key: SPARK-38166
>                 URL: https://issues.apache.org/jira/browse/SPARK-38166
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.2
>         Environment: Cluster runs on K8s. AQE is enabled.
>            Reporter: Willi Raschkowski
>            Priority: Major
>              Labels: correctness
>         Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
>     shipments = shipments.dropDuplicates(["ship_trck_num"])
>     shipments = shipments.repartition(4)
>     return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs and the 
> notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to