Willi Raschkowski created SPARK-38166:
-----------------------------------------

             Summary: Duplicates after task failure in dropDuplicates and 
repartition
                 Key: SPARK-38166
                 URL: https://issues.apache.org/jira/browse/SPARK-38166
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.2
         Environment: Cluster runs on K8s. AQE is enabled.
            Reporter: Willi Raschkowski


We're seeing duplicates after running the following 

{code}
def compute_shipments(shipments):
    shipments = shipments.dropDuplicates(["ship_trck_num"])
    shipments = shipments.repartition(4)
    return shipments
{code}

and observing lost executors (OOMs) and task retries in the repartition stage.

We're seeing this reliably in one of our pipelines. But I haven't managed to 
reproduce outside of that pipeline. I'll attach driver logs and the 
notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to