[ https://issues.apache.org/jira/browse/SPARK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148544#comment-16148544 ]
Kai Londenberg commented on SPARK-18523: ---------------------------------------- In PySpark 2.2.0 this issue was not really fixed for me. While I could close the SparkContext (with an Exception message), I could not reopen any new spark contexts. If I resetted the global SparkContext variables like this, it worked: {{ def reset_spark(): import pyspark from threading import RLock pyspark.SparkContext._jvm = None pyspark.SparkContext._gateway = None pyspark.SparkContext._next_accum_id = 0 pyspark.SparkContext._active_spark_context = None pyspark.SparkContext._lock = RLock() pyspark.SparkContext._python_includes = None reset_spark() }} > OOM killer may leave SparkContext in broken state causing Connection Refused > errors > ----------------------------------------------------------------------------------- > > Key: SPARK-18523 > URL: https://issues.apache.org/jira/browse/SPARK-18523 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.1, 2.0.0 > Reporter: Alexander Shorin > Assignee: Alexander Shorin > Fix For: 2.1.0 > > > When you run some memory-heavy spark job, Spark driver may consume more > memory resources than host available to provide. > In this case OOM killer comes on scene and successfully kills a spark-submit > process. > The pyspark.SparkContext is not able to handle such state of things and > becomes completely broken. > You cannot stop it as on stop it tries to call stop method of bounded java > context (jsc) and fails with Py4JError, because such process no longer exists > as like as the connection to it. > You cannot start new SparkContext because you have your broken one as active > one and pyspark still is not able to not have SparkContext as sort of > singleton. > The only thing you can do is shutdown your IPython Notebook and start it > over. Or dive into SparkContext internal attributes and reset them manually > to initial None state. > The OOM killer case is just one of the many: any reason of spark-submit crash > in the middle of something leaves SparkContext in a broken state. > Example on error log on {{sc.stop()}} in broken state: > {code} > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 883, in send_command > response = connection.send_command(command) > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 1040, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > Py4JNetworkError: Error while receiving > ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java > server (127.0.0.1:59911) > Traceback (most recent call last): > File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line > 963, in start > self.socket.connect((self.address, self.port)) > File "/usr/local/lib/python2.7/socket.py", line 224, in meth > return getattr(self._sock,name)(*args) > error: [Errno 61] Connection refused > --------------------------------------------------------------------------- > Py4JError Traceback (most recent call last) > <ipython-input-2-f154e069615b> in <module>() > ----> 1 sc.stop() > /usr/local/share/spark/python/pyspark/context.py in stop(self) > 360 """ > 361 if getattr(self, "_jsc", None): > --> 362 self._jsc.stop() > 363 self._jsc = None > 364 if getattr(self, "_accumulatorServer", None): > /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in > __call__(self, *args) > 1131 answer = self.gateway_client.send_command(command) > 1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) > 1134 > 1135 for temp_arg in temp_args: > /usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 43 def deco(*a, **kw): > 44 try: > ---> 45 return f(*a, **kw) > 46 except py4j.protocol.Py4JJavaError as e: > 47 s = e.java_exception.toString() > /usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in > get_return_value(answer, gateway_client, target_id, name) > 325 raise Py4JError( > 326 "An error occurred while calling {0}{1}{2}". > --> 327 format(target_id, ".", name)) > 328 else: > 329 type = answer[1] > Py4JError: An error occurred while calling o47.stop > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org