[jira] [Updated] (SPARK-36966) Spark evicts RDD partitions instead of allowing OOM

sam (Jira) Sat, 16 Oct 2021 10:19:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sam updated SPARK-36966:
------------------------
    Affects Version/s: 1.6.0

> Spark evicts RDD partitions instead of allowing OOM
> ---------------------------------------------------
>
>                 Key: SPARK-36966
>                 URL: https://issues.apache.org/jira/browse/SPARK-36966
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0, 2.4.4
>            Reporter: sam
>            Priority: Major
>
> In the past Spark (pre 1.6) jobs would give OOM if an RDD could not fit into 
> memory (when trying to cache with MEMORY_ONLY). These days it seems Spark 
> jobs will evict partitions from the cache and recompute them from scratch.
> We have some jobs that cache an RDD then traverse it 300 times.  In order to 
> know when we need to increase the memory on our cluster, we need to know when 
> it has run out of memory.
> The new behaviour of Spark makes this difficult ... rather than the job 
> OOMing (like in the past), the job instead just takes forever (our 
> surrounding logic eventually times out the job).  Diagnosing why the job 
> failed becomes difficult because it's not immediately obvious from the logs 
> that the job has run out of memory (since no OOM is thrown).  One can find 
> "evicted" log lines.
> As a bit of a hack, we are using accumulators with mapPartitionsWithIndex and 
> updating a count of the number of times each partition is traversed, then we 
> forcibly blow up the job when this count is 2 or more (indicating an 
> eviction).  This hack doesn't work very well as it seems to give false 
> positives (RCA not yet understood, it doesn't seem to be speculative 
> execution, nor are tasks failing (`egrep "task .* in stage [0-9]+\.1" | wc 
> -l` gives 0).
> *Question 1*: Is there a way to disable this new behaviour of Spark and make 
> it behave like it used to (i.e. just blow up with OOM) - I've looked in the 
> Spark configuration and cannot find anything like "disable-eviction".
> *Question 2*: Is there any method on the SparkSession or SparkContext that we 
> can call to easily detect when eviction is happening?
> If not, then for our use case this is an effective regression - we need a way 
> to make Spark behave predictably, or at least a way to determine 
> automatically when Spark is running slowly due to lack of memory.
> FOR CONTEXT
> > Borrowed storage memory may be evicted when memory pressure arises
> From 
> https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-spark-10000.pdf,
>  which is attached to https://issues.apache.org/jira/browse/SPARK-10000 which 
> was implemented in 1.6 
> https://spark.apache.org/releases/spark-release-1-6-0.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36966) Spark evicts RDD partitions instead of allowing OOM

Reply via email to