[ 
https://issues.apache.org/jira/browse/SPARK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-18523:
-----------------------------------
    Labels: pull-request-available  (was: )

> OOM killer may leave SparkContext in broken state causing Connection Refused 
> errors
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-18523
>                 URL: https://issues.apache.org/jira/browse/SPARK-18523
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.1, 2.0.0
>            Reporter: Alexander Shorin
>            Assignee: Alexander Shorin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.1.0
>
>
> When you run some memory-heavy spark job, Spark driver may consume more 
> memory resources than host available to provide.
> In this case OOM killer comes on scene and successfully kills a spark-submit 
> process.
> The pyspark.SparkContext is not able to handle such state of things and 
> becomes completely broken. 
> You cannot stop it as on stop it tries to call stop method of bounded java 
> context (jsc) and fails with Py4JError, because such process no longer exists 
> as like as the connection to it. 
> You cannot start new SparkContext because you have your broken one as active 
> one and pyspark still is not able to not have SparkContext as sort of 
> singleton.
> The only thing you can do is shutdown your IPython Notebook and start it 
> over. Or dive into SparkContext internal attributes and reset them manually 
> to initial None state.
> The OOM killer case is just one of the many: any reason of spark-submit crash 
> in the middle of something leaves SparkContext in a broken state.
> Example on error log on {{sc.stop()}} in broken state:
> {code}
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
> 883, in send_command
>     response = connection.send_command(command)
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
> 1040, in send_command
>     "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:59911)
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 
> 963, in start
>     self.socket.connect((self.address, self.port))
>   File "/usr/local/lib/python2.7/socket.py", line 224, in meth
>     return getattr(self._sock,name)(*args)
> error: [Errno 61] Connection refused
> ---------------------------------------------------------------------------
> Py4JError                                 Traceback (most recent call last)
> <ipython-input-2-f154e069615b> in <module>()
> ----> 1 sc.stop()
> /usr/local/share/spark/python/pyspark/context.py in stop(self)
>     360         """
>     361         if getattr(self, "_jsc", None):
> --> 362             self._jsc.stop()
>     363             self._jsc = None
>     364         if getattr(self, "_accumulatorServer", None):
> /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in 
> __call__(self, *args)
>    1131         answer = self.gateway_client.send_command(command)
>    1132         return_value = get_return_value(
> -> 1133             answer, self.gateway_client, self.target_id, self.name)
>    1134 
>    1135         for temp_arg in temp_args:
> /usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>      43     def deco(*a, **kw):
>      44         try:
> ---> 45             return f(*a, **kw)
>      46         except py4j.protocol.Py4JJavaError as e:
>      47             s = e.java_exception.toString()
> /usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in 
> get_return_value(answer, gateway_client, target_id, name)
>     325             raise Py4JError(
>     326                 "An error occurred while calling {0}{1}{2}".
> --> 327                 format(target_id, ".", name))
>     328     else:
>     329         type = answer[1]
> Py4JError: An error occurred while calling o47.stop
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to