David Watson created SPARK-11711:
------------------------------------

             Summary: Finalizer memory leak is pyspark
                 Key: SPARK-11711
                 URL: https://issues.apache.org/jira/browse/SPARK-11711
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.5.1
            Reporter: David Watson


I've been having super consistent memory leaks in the java process of python 
spark streaming scripts on my driver.  A heap profile analysis showed MILLIONS 
of Finalizer objects.  

The spark web interface under Executor Thread Dump shows:
Thread 3: Finalizer (WAITING):
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:152)
java.net.SocketInputStream.read(SocketInputStream.java:122)
sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
java.io.InputStreamReader.read(InputStreamReader.java:184)
java.io.BufferedReader.fill(BufferedReader.java:154)
java.io.BufferedReader.readLine(BufferedReader.java:317)
java.io.BufferedReader.readLine(BufferedReader.java:382)
py4j.CallbackConnection.sendCommand(CallbackConnection.java:82)
py4j.CallbackClient.sendCommand(CallbackClient.java:236)
py4j.reflection.PythonProxyHandler.finalize(PythonProxyHandler.java:81)
java.lang.System$2.invokeFinalize(System.java:1213)
java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
java.lang.ref.Finalizer.access$100(Finalizer.java:34)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)

It appears the problem is with py4j.  I don't have a patch because the bug is 
inside the python/lib/py4j-0.8.2.1-src.zip zip file.  I've monkey patched and 
it appears to fix the problem.

in py4j.java_gateway.CallbackConnection:1186 run():
"""
                elif command == GARBAGE_COLLECT_PROXY_COMMAND_NAME:
                    self.input.readline()
                    del(self.pool[obj_id])
"""
NOTE: it doesn't write a response to the socket!

and on the java side, CallbackConnection.java:82 sendCommand():
"""
                        returnCommand = reader.readLine();
"""
I don't know what the protocol is supposed to be, but the java side wants a 
response but the python side isn't sending it.  As you can see from the stack 
trace, this jams up the java FinalizerThread which keeps anything from getting 
finalized, spark related or not.

My monkey patch to py4j.java_gateway.CallbackConnection:1186 run():
"""
                elif command == GARBAGE_COLLECT_PROXY_COMMAND_NAME:
                    self.input.readline()
                    del(self.pool[obj_id])
+                    ## PATCH: send an empty response!
+                    self.socket.sendall("\n")
+                    ##
"""

This bug appears to exist in the current py4j, but I can't find the repository 
for the 0.8.2.1 version embedded in spark.

I'm not entirely sure, but I suspect that (at least on the driver) this doesn't 
normally get triggered because the java object references held by python are 
long lived so it wouldn't get triggered (thus jamming up the FinalizerThread) 
until the program ends. 

My code is peeking at checkpoint file (code below) before starting the script, 
which looks like it's jamming things up at the beginning, and any other 
finalized objects (scala? java?) are piling up behind it.

"""
def loadCheckpoint(checkpointPath):
    StreamingContext._ensure_initialized()
    gw = SparkContext._gateway

    # Check whether valid checkpoint information exists in the given path
    cp = gw.jvm.CheckpointReader.read(checkpointPath)
    assert cp is not None, "Couldn't load %s" % checkpointPath
    return cp.get()
"""

At any rate, I can confirm that the same situation exists on the worker nodes 
as well as the driver and this fixes both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to