[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors

Kai Londenberg (JIRA) Thu, 31 Aug 2017 00:16:13 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kai Londenberg updated SPARK-21881:
-----------------------------------
    Description: 
This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 
2.2.0, Python 3.5, py4j 0.10.4 )

*Original Summary:*

When you run some memory-heavy spark job, Spark driver may consume more memory 
resources than host available to provide.

In this case OOM killer comes on scene and successfully kills a spark-submit 
process.
The pyspark.SparkContext is not able to handle such state of things and becomes 
completely broken. 
You cannot stop it as on stop it tries to call stop method of bounded java 
context (jsc) and fails with Py4JError, because such process no longer exists 
as like as the connection to it. 
You cannot start new SparkContext because you have your broken one as active 
one and pyspark still is not able to not have SparkContext as sort of singleton.
The only thing you can do is shutdown your IPython Notebook and start it over. 
Or dive into SparkContext internal attributes and reset them manually to 
initial None state.

The OOM killer case is just one of the many: any reason of spark-submit crash 
in the middle of something leaves SparkContext in a broken state.

*Latest Comment*

In PySpark 2.2.0 this issue was not really fixed. While I could close the 
SparkContext (with an Exception message, but it was closed afterwards), I could 
not reopen any new spark contexts.

*Current Workaround*

If I resetted the global SparkContext variables like this, it worked :

{code:none}
def reset_spark():
    import pyspark
    from threading import RLock
    pyspark.SparkContext._jvm = None
    pyspark.SparkContext._gateway = None
    pyspark.SparkContext._next_accum_id = 0
    pyspark.SparkContext._active_spark_context = None
    pyspark.SparkContext._lock = RLock()
    pyspark.SparkContext._python_includes = None
reset_spark()
{code}






  was:

This is a duplicate of SPARK-18523, which was not really fixed for me (PySpark 
2.2.0, Python 3.5, py4j 0.10.4 )

*Original Summary:*

When you run some memory-heavy spark job, Spark driver may consume more memory 
resources than host available to provide.

In this case OOM killer comes on scene and successfully kills a spark-submit 
process.
The pyspark.SparkContext is not able to handle such state of things and becomes 
completely broken. 
You cannot stop it as on stop it tries to call stop method of bounded java 
context (jsc) and fails with Py4JError, because such process no longer exists 
as like as the connection to it. 
You cannot start new SparkContext because you have your broken one as active 
one and pyspark still is not able to not have SparkContext as sort of singleton.
The only thing you can do is shutdown your IPython Notebook and start it over. 
Or dive into SparkContext internal attributes and reset them manually to 
initial None state.

The OOM killer case is just one of the many: any reason of spark-submit crash 
in the middle of something leaves SparkContext in a broken state.

*Latest Comment (for PySpark 2.2.0) *

In PySpark 2.2.0 this issue was not really fixed. While I could close the 
SparkContext (with an Exception message, but it was closed afterwards), I could 
not reopen any new spark contexts.

*Current Workaround*

If I resetted the global SparkContext variables like this, it worked :

{code:none}
def reset_spark():
    import pyspark
    from threading import RLock
    pyspark.SparkContext._jvm = None
    pyspark.SparkContext._gateway = None
    pyspark.SparkContext._next_accum_id = 0
    pyspark.SparkContext._active_spark_context = None
    pyspark.SparkContext._lock = RLock()
    pyspark.SparkContext._python_includes = None
reset_spark()
{code}







> Again: OOM killer may leave SparkContext in broken state causing Connection 
> Refused errors
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21881
>                 URL: https://issues.apache.org/jira/browse/SPARK-21881
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Kai Londenberg
>            Assignee: Alexander Shorin
>
> This is a duplicate of SPARK-18523, which was not really fixed for me 
> (PySpark 2.2.0, Python 3.5, py4j 0.10.4 )
> *Original Summary:*
> When you run some memory-heavy spark job, Spark driver may consume more 
> memory resources than host available to provide.
> In this case OOM killer comes on scene and successfully kills a spark-submit 
> process.
> The pyspark.SparkContext is not able to handle such state of things and 
> becomes completely broken. 
> You cannot stop it as on stop it tries to call stop method of bounded java 
> context (jsc) and fails with Py4JError, because such process no longer exists 
> as like as the connection to it. 
> You cannot start new SparkContext because you have your broken one as active 
> one and pyspark still is not able to not have SparkContext as sort of 
> singleton.
> The only thing you can do is shutdown your IPython Notebook and start it 
> over. Or dive into SparkContext internal attributes and reset them manually 
> to initial None state.
> The OOM killer case is just one of the many: any reason of spark-submit crash 
> in the middle of something leaves SparkContext in a broken state.
> *Latest Comment*
> In PySpark 2.2.0 this issue was not really fixed. While I could close the 
> SparkContext (with an Exception message, but it was closed afterwards), I 
> could not reopen any new spark contexts.
> *Current Workaround*
> If I resetted the global SparkContext variables like this, it worked :
> {code:none}
> def reset_spark():
>     import pyspark
>     from threading import RLock
>     pyspark.SparkContext._jvm = None
>     pyspark.SparkContext._gateway = None
>     pyspark.SparkContext._next_accum_id = 0
>     pyspark.SparkContext._active_spark_context = None
>     pyspark.SparkContext._lock = RLock()
>     pyspark.SparkContext._python_includes = None
> reset_spark()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21881) Again: OOM killer may leave SparkContext in broken state causing Connection Refused errors

Reply via email to