Antony Mayi created SPARK-12511:
-----------------------------------

             Summary: streaming driver with checkpointing unable to finalize 
leading to OOM
                 Key: SPARK-12511
                 URL: https://issues.apache.org/jira/browse/SPARK-12511
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.5.2
         Environment: pyspark 1.5.2
yarn 2.6.0
python 2.6
centos 6.5
openjdk 1.8.0
            Reporter: Antony Mayi
            Priority: Critical


Spark streaming application when configured with checkpointing is filling 
driver's heap with multiple ZipFileInputStream instances as results of 
spark-assembly.jar (potentially some others like for example snappy-java.jar) 
getting repetitively referenced (loaded?). Java Finalizer can't finalize these 
ZipFileInputStream instances and it eventually takes all heap leading the 
driver to OOM crash.

h2. Steps to reproduce:
* Submit attached bug.py to spark
* Leave it running and monitor the driver java process heap
** with heap dump you will primarily see growing instances of byte array data 
(here cumulating zip payload of the jar refs):
{noformat}
 num     #instances         #bytes  class name
----------------------------------------------
   1:         32653       32735296  [B
   2:         48000        5135816  [C
   3:            41        1344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
   4:         11362        1261816  java.lang.Class
   5:         47054        1129296  java.lang.String
   6:         25460        1018400  java.lang.ref.Finalizer
   7:          9802         789400  [Ljava.lang.Object;
{noformat}
** with virtualvm you can see:
*** increasing number of objects pending for finalization
*** increasing number of ZipFileInputStreams instances related to the 
spark-assembly.jar referenced by Finalizer
* Depending on the heap size and running time this will lead to driver OOM crash

h2. Comments
* The bug.py is lightweight proof of the problem. In production I am 
experiencing this as quite rapid effect - in few hours it eats gigs of heap and 
kills the app.
* If the same bug.py is run without checkpointing there is no issue whatsoever.
* Not sure if it is just pyspark related.
* In bug.py I am using the socketTextStream input but seems to be independent 
of the input type (in production having same problem with Kafka direct stream, 
have seen it even with textFileStream).
* It is happening even if the input stream doesn't produce any data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to