[ 
https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073344#comment-15073344
 ] 

Shixiong Zhu edited comment on SPARK-12511 at 12/29/15 1:33 AM:
----------------------------------------------------------------

Has not yet figured out the root cause. Here are my found right now: the 
"Finalizer" thread is blocked by py4j, so the finalizer queue keeps growing.

{code}
"Finalizer" #3 daemon prio=8 os_prio=31 tid=0x00007feaa380e000 nid=0x3503 
runnable [0x0000000117ca4000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        - locked <0x00000007813be228> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        - locked <0x00000007813be228> (a java.io.InputStreamReader)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at py4j.CallbackConnection.sendCommand(CallbackConnection.java:82)
        at py4j.CallbackClient.sendCommand(CallbackClient.java:236)
        at 
py4j.reflection.PythonProxyHandler.finalize(PythonProxyHandler.java:81)
        at java.lang.System$2.invokeFinalize(System.java:1270)
        at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
        at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)
{code}


was (Author: zsxwing):
Has not yet figured out the root cause. Here are my found right now: the 
"Finalizer" thread is blocked by py4j, so the finalizer keeps growing.

{code}
"Finalizer" #3 daemon prio=8 os_prio=31 tid=0x00007feaa380e000 nid=0x3503 
runnable [0x0000000117ca4000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        - locked <0x00000007813be228> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        - locked <0x00000007813be228> (a java.io.InputStreamReader)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at py4j.CallbackConnection.sendCommand(CallbackConnection.java:82)
        at py4j.CallbackClient.sendCommand(CallbackClient.java:236)
        at 
py4j.reflection.PythonProxyHandler.finalize(PythonProxyHandler.java:81)
        at java.lang.System$2.invokeFinalize(System.java:1270)
        at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
        at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)
{code}

> streaming driver with checkpointing unable to finalize leading to OOM
> ---------------------------------------------------------------------
>
>                 Key: SPARK-12511
>                 URL: https://issues.apache.org/jira/browse/SPARK-12511
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Streaming
>    Affects Versions: 1.5.2
>         Environment: pyspark 1.5.2
> yarn 2.6.0
> python 2.6
> centos 6.5
> openjdk 1.8.0
>            Reporter: Antony Mayi
>            Assignee: Shixiong Zhu
>            Priority: Critical
>         Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, 
> finalizer-spark_assembly.png
>
>
> Spark streaming application when configured with checkpointing is filling 
> driver's heap with multiple ZipFileInputStream instances as results of 
> spark-assembly.jar (potentially some others like for example snappy-java.jar) 
> getting repetitively referenced (loaded?). Java Finalizer can't finalize 
> these ZipFileInputStream instances and it eventually takes all heap leading 
> the driver to OOM crash.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Leave it running and monitor the driver java process heap
> ** with heap dump you will primarily see growing instances of byte array data 
> (here cumulated zip payload of the jar refs):
> {noformat}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:         32653       32735296  [B
>    2:         48000        5135816  [C
>    3:            41        1344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
>    4:         11362        1261816  java.lang.Class
>    5:         47054        1129296  java.lang.String
>    6:         25460        1018400  java.lang.ref.Finalizer
>    7:          9802         789400  [Ljava.lang.Object;
> {noformat}
> ** with visualvm you can see:
> *** increasing number of objects pending for finalization
> !finalizer-pending.png!
> *** increasing number of ZipFileInputStreams instances related to the 
> spark-assembly.jar referenced by Finalizer
> !finalizer-spark_assembly.png!
> * Depending on the heap size and running time this will lead to driver OOM 
> crash
> h2. Comments
> * The [^bug.py] is lightweight proof of the problem. In production I am 
> experiencing this as quite rapid effect - in few hours it eats gigs of heap 
> and kills the app.
> * If the same [^bug.py] is run without checkpointing there is no issue 
> whatsoever.
> * Not sure if it is just pyspark related.
> * In [^bug.py] I am using the socketTextStream input but seems to be 
> independent of the input type (in production having same problem with Kafka 
> direct stream, have seen it even with textFileStream).
> * It is happening even if the input stream doesn't produce any data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to