Milan Straka created SPARK-3731:
-----------------------------------

             Summary: RDD caching stops working in pyspark after some time
                 Key: SPARK-3731
                 URL: https://issues.apache.org/jira/browse/SPARK-3731
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core
    Affects Versions: 1.1.0
         Environment: Linux, 32bit, standalone mode
            Reporter: Milan Straka


Consider a file F which when loaded with sc.textFile and cached takes up 
slightly more than half of free memory for RDD cache.

When in PySpark the following is executed:
  1) a = sc.textFile(F)
  2) a.cache().count()
  3) b = sc.textFile(F)
  4) b.cache().count()
and then the following is repeated (for example 10 times):
  a) a.unpersist().cache().count()
  b) b.unpersist().cache().count()
after some time, there are no RDD cached in memory.

Also, since that time, no other RDD ever gets cached (the worker always reports 
something like "WARN CacheManager: Not enough space to cache partition rdd_23_5 
in memory! Free memory is 277478190 bytes.", even if rdd_23_5 is ~50MB). The 
Executors tab of the Application Detail UI shows that all executors have 0MB 
memory used (which is consistent with the CacheManager warning).

When doing the same in scala, everything works perfectly.

I understand that this is a vague description, but I do no know how to describe 
the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to