[jira] [Commented] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250055#comment-15250055
 ] 

Paul Shearer commented on SPARK-14740:
--

It appears it's accessible through cvModel.bestModel._java_obj.getRegParam(), 
just need to get this in the right place in the code.

> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue raised on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-20 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Component/s: (was: Spark Core)
 PySpark
 MLlib

> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue raised on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249921#comment-15249921
 ] 

Paul Shearer commented on SPARK-13973:
--

Done: https://github.com/apache/spark/pull/12528

I'm a bit new to this, happy to fix any issues.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249705#comment-15249705
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 12:09 PM:


Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

As the code now stands, the IPython shell user's pyspark is silently broken if 
they happen to have set a deprecated option when using an older version of 
pyspark. Fixable yes, but more confusing than necessary, and opposite to the 
intention of backwards compatibility.


was (Author: pshearer):
Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

As the code now stands, the IPython shell user's pyspark is silently broken if 
they happen to have set a deprecated option when using an older version of 
pyspark. Not exactly the definition of backwards compatibility.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249705#comment-15249705
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 12:08 PM:


Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

As the code now stands, the IPython shell user's pyspark is silently broken if 
they happen to have set a deprecated option when using an older version of 
pyspark. Not exactly the definition of backwards compatibility.


was (Author: pshearer):
Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249705#comment-15249705
 ] 

Paul Shearer commented on SPARK-13973:
--

Bottom line... I think IPYTHON=1 should either 

(1) mean what it appears to mean - IPython and not necessarily the notebook - 
or 
(2) be removed entirely as too confusing.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:36 AM:


In my development experience and those I know, the notebook is not the only or 
even the main use case for IPython. It's just the most visible because it's 
easy to make slides and such. IPython, without the notebook, is a productive 
general-purpose programming environment for data scientists.

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:36 AM:


In my development experience and those I know, the notebook is not the only or 
even the main use case for IPython. It's just the most visible because it's 
easy to make slides and webpages and such. IPython, without the notebook, is a 
productive general-purpose programming environment for data scientists.

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
In my development experience and those I know, the notebook is not the only or 
even the main use case for IPython. It's just the most visible because it's 
easy to make slides and such. IPython, without the notebook, is a productive 
general-purpose programming environment for data scientists.

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by 

[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:18 AM:


Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for general 
programming / power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:17 AM:


Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. 
I found it easy enough just to comment out `IPYTHON=1` once I realized that was 
the problem, so alternatively you could print some alert that this is 
deprecated and tell people how to change it, so you don't have to keep 
supporting it.


was (Author: pshearer):
Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656
 ] 

Paul Shearer commented on SPARK-13973:
--

Just to get us on the same page - pyspark is a python shell with SparkContext 
pre-loaded. The python shell can be either default python shell or the enhanced 
IPython shell. The latter offers tab completion of object attributes, nicely 
formatted tracebacks, "magic" macros for executing pasted code snippets, 
scripting, debugging, and accessing help/docstrings. It is a separate entity 
from the notebook and long predates the notebook. 

Most working scientists and analysts I know, including myself, use the IPython 
shell much more than the notebook. The notebook is more of a presentation and 
exploratory analysis tool, while the IPython shell is better for power users.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249628#comment-15249628
 ] 

Paul Shearer edited comment on SPARK-13973 at 4/20/16 10:48 AM:


The problem with this change is that it creates a bug for users who simply want 
the IPython interactive shell, as opposed to the notebook. `ipython` with no 
arguments starts the IPython shell, but `jupyter` with no arguments results in 
the following error:

{noformat}
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
   [--paths] [--json]
   [subcommand]
jupyter: error: one of the arguments --version subcommand --config-dir 
--data-dir --runtime-dir --paths is required
{noformat}

I can't speak for the general Python community but as a data scientist, 
personally I find the IPython notebook only suitable for very basic exploratory 
analysis - any sort of application development is much better served by the 
IPython shell, so I'm always using the shell and rarely the notebook. So I 
prefer the old script.

Perhaps the best answer is to stop maintaining an unsustainable backwards 
compatibility. The committed change broke the pyspark startup script in my 
case, and the old startup script will eventually be broken when `ipython 
notebook` is deprecated. So perhaps `IPYTHON=1` should just result in some kind 
of error message prompting the user to switch to the new PYSPARK_DRIVER_PYTHON 
config style. Most Spark users knows the installation process is not seamless 
and requires mucking about with environment variables - they might as well be 
told to do it in a way that's convenient to the development team.


was (Author: pshearer):
The problem with this change is that it creates a bug for users who simply want 
the IPython interactive shell, as opposed to the notebook. `ipython` with no 
arguments starts the IPython shell, but `jupyter` with no arguments results in 
the following error:

{noformat}
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
   [--paths] [--json]
   [subcommand]
jupyter: error: one of the arguments --version subcommand --config-dir 
--data-dir --runtime-dir --paths is required
{noformat}

I can't speak for the general Python community but as a data scientist, 
personally I find the IPython notebook only suitable for very basic exploratory 
analysis - any sort of application development is much better served by the 
IPython shell, so I'm always using the shell and rarely the notebook.

It seems like maintaining this old configuration switch is no longer 
sustainable. The change breaks it for my case, and the old state will 
eventually be broken when `ipython notebook` is deprecated. So perhaps 
`IPYTHON=1` should just result in some kind of error message prompting the user 
to switch to the new PYSPARK_DRIVER_PYTHON config style.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--

[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...

2016-04-20 Thread Paul Shearer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249628#comment-15249628
 ] 

Paul Shearer commented on SPARK-13973:
--

The problem with this change is that it creates a bug for users who simply want 
the IPython interactive shell, as opposed to the notebook. `ipython` with no 
arguments starts the IPython shell, but `jupyter` with no arguments results in 
the following error:

{noformat}
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
   [--paths] [--json]
   [subcommand]
jupyter: error: one of the arguments --version subcommand --config-dir 
--data-dir --runtime-dir --paths is required
{noformat}

I can't speak for the general Python community but as a data scientist, 
personally I find the IPython notebook only suitable for very basic exploratory 
analysis - any sort of application development is much better served by the 
IPython shell, so I'm always using the shell and rarely the notebook.

It seems like maintaining this old configuration switch is no longer 
sustainable. The change breaks it for my case, and the old state will 
eventually be broken when `ipython notebook` is deprecated. So perhaps 
`IPYTHON=1` should just result in some kind of error message prompting the user 
to switch to the new PYSPARK_DRIVER_PYTHON config style.

> `ipython notebook` is going away...
> ---
>
> Key: SPARK-13973
> URL: https://issues.apache.org/jira/browse/SPARK-13973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
> Environment: spark-1.6.1-bin-hadoop2.6
> Anaconda2-2.5.0-Linux-x86_64
>Reporter: Bogdan Pirvu
>Assignee: Rekha Joshi
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Starting {{pyspark}} with following environment variables:
> {code:none}
> export IPYTHON=1
> export IPYTHON_OPTS="notebook --no-browser"
> {code}
> yields this warning
> {code:none}
> [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated 
> and will be removed in future versions.
> [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... 
> continue in 5 sec. Press Ctrl-C to quit now.
> {code}
> Changing line 52 from
> {code:none}
> PYSPARK_DRIVER_PYTHON="ipython"
> {code}
> to
> {code:none}
> PYSPARK_DRIVER_PYTHON="jupyter"
> {code}
> in https://github.com/apache/spark/blob/master/bin/pyspark works for me to 
> solve this issue, but I'm not sure if it's sustainable as I'm not familiar 
> with the rest of the code...
> This is the relevant part of my Python environment:
> {code:none}
> ipython   4.1.2py27_0  
> ipython-genutils  0.1.0 
> ipython_genutils  0.1.0py27_0  
> ipywidgets4.1.1py27_0  
> ...
> jupyter   1.0.0py27_1  
> jupyter-client4.2.1 
> jupyter-console   4.1.1 
> jupyter-core  4.1.0 
> jupyter_client4.2.1py27_0  
> jupyter_console   4.1.1py27_0  
> jupyter_core  4.1.0py27_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For the original issue raised on StackOverflow please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue raised on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was 

[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{{
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
}}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{{
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
}}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For a simple example please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For the original issue on StackOverflow please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

```
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
```

I can get the regression coefficient out, but I can't get the regularization 
parameter

```
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
```

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{{
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
}}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{{
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
}}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For the original issue on StackOverflow please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {{
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> }}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {{
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> }}
> For a simple example please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

```
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
```

I can get the regression coefficient out, but I can't get the regularization 
parameter

```
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
```

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

`
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
`

I can get the regression coefficient out, but I can't get the regularization 
parameter

`
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
`

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> ```
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> ```
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> ```
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> ```
> For a simple example please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Created] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)
Paul Shearer created SPARK-14740:


 Summary: CrossValidatorModel.bestModel does not include 
hyper-parameters
 Key: SPARK-14740
 URL: https://issues.apache.org/jira/browse/SPARK-14740
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Paul Shearer


If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

`
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
`

I can get the regression coefficient out, but I can't get the regularization 
parameter

`
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
`

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-03-29 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14241:
-
Description: 
If you use monotonically_increasing_id() to append a column of IDs to a 
DataFrame, the IDs do not have a stable, deterministic relationship to the rows 
they are appended to. A given ID value can land on different rows depending on 
what happens in the task graph:

http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

>From a user perspective this behavior is very unexpected, and many things one 
>would normally like to do with an ID column are in fact only possible under 
>very narrow circumstances. The function should either be made deterministic, 
>or there should be a prominent warning note in the API docs regarding its 
>behavior.

  was:
If you use monotonically_increasing_id() to append a column of IDs to a 
DataFrame, the IDs do not have a stable, deterministic relationship to the rows 
they are appended to. A given ID value can land on different rows depending on 
what happens in the task graph:

http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

>From a user perspective this behavior is very unexpected, and many things one 
>would like to do with an ID column are only possible under very narrow 
>circumstances. The function should either be made deterministic, or there 
>should be a prominent warning note in the API docs regarding its behavior.


> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-03-29 Thread Paul Shearer (JIRA)
Paul Shearer created SPARK-14241:


 Summary: Output of monotonically_increasing_id lacks stable 
relation with rows of DataFrame
 Key: SPARK-14241
 URL: https://issues.apache.org/jira/browse/SPARK-14241
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.6.1, 1.6.0
Reporter: Paul Shearer


If you use monotonically_increasing_id() to append a column of IDs to a 
DataFrame, the IDs do not have a stable, deterministic relationship to the rows 
they are appended to. A given ID value can land on different rows depending on 
what happens in the task graph:

http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321

>From a user perspective this behavior is very unexpected, and many things one 
>would like to do with an ID column are only possible under very narrow 
>circumstances. The function should either be made deterministic, or there 
>should be a prominent warning note in the API docs regarding its behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12824:
-
Description: 
Below is a simple `pyspark` script that tries to split an RDD into a dictionary 
containing several RDDs. 

As the **sample run** shows, the script only works if we do a `collect()` on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate `collect()` 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the `collect()` call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using `collect()`?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

spark_script.py
```
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])
key_field = 'color'
key_values = ['red', 'green', 'blue', 'purple']

print '### run WITH collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
print_results(d)
print '### run WITHOUT collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
print_results(d)
```

Sample run in IPython shell

```
In [1]: execfile('spark_script.py')
### run WITH collect in loop: 
blue
[{   'color': 'blue', 'size': 4}]
purple
[{   'color': 'purple', 'size': 6}]
green
[   {   'color': 'green', 'size': 9},
{   'color': 'green', 'size': 5},
{   'color': 'green', 'size': 50}]
red
[   {   'color': 'red', 'size': 3},
{   'color': 'red', 'size': 7},
{   'color': 'red', 'size': 8},
{   'color': 'red', 'size': 10}]
### run WITHOUT collect in loop: 
blue
[{   'color': 'purple', 'size': 6}]
purple
[{   'color': 'purple', 'size': 6}]
green
[{   'color': 'purple', 'size': 6}]
red
[{   'color': 'purple', 'size': 6}]
```

  was:
Below is a simple `pyspark` script that tries to split an RDD into a dictionary 
containing several RDDs. 

As the **sample run** shows, the script only works if we do a `collect()` on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate `collect()` 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the `collect()` call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using `collect()`?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

### spark_script.py

from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},

[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12824:
-
Affects Version/s: 1.5.2

> Failure to maintain consistent RDD references in pyspark
> 
>
> Key: SPARK-12824
> URL: https://issues.apache.org/jira/browse/SPARK-12824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
>Reporter: Paul Shearer
>
> Below is a simple {{pyspark}} script that tries to split an RDD into a 
> dictionary containing several RDDs. 
> As the *sample run* shows, the script only works if we do a {{collect()}} on 
> the intermediate RDDs as they are created. Of course I would not want to do 
> that in practice, since it doesn't scale.
> What's really strange is, I'm not assigning the intermediate {{collect()}} 
> results to any variable. So the difference in behavior is due solely to a 
> hidden side-effect of the computation triggered by the {{collect()}} call. 
> Spark is supposed to be a very functional framework with minimal side 
> effects. Why is it only possible to get the desired behavior by triggering 
> some mysterious side effect using {{collect()}}? 
> It seems that pyspark is not keeping consistent track of the filter 
> transformation applied to the RDD, so the object assigned to the dictionary 
> is always the same, even though the RDDs are supposed to be different.
> The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
> h3. spark_script.py
> {noformat}
> from pprint import PrettyPrinter
> pp = PrettyPrinter(indent=4).pprint
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
> 
> def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
> d = dict()
> for key_value in key_values:
> d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
> if collect_in_loop:
> d[key_value].collect()
> return d
> def print_results(d):
> for k in d:
> print k
> pp(d[k].collect())
> 
> rdd = sc.parallelize([
> {'color':'red','size':3},
> {'color':'red', 'size':7},
> {'color':'red', 'size':8},
> {'color':'red', 'size':10},
> {'color':'green', 'size':9},
> {'color':'green', 'size':5},
> {'color':'green', 'size':50},
> {'color':'blue', 'size':4},
> {'color':'purple', 'size':6}])
> key_field = 'color'
> key_values = ['red', 'green', 'blue', 'purple']
> 
> print '### run WITH collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
> print_results(d)
> print '### run WITHOUT collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
> print_results(d)
> {noformat}
> h3. Sample run in IPython shell
> {noformat}
> In [1]: execfile('spark_script.py')
> ### run WITH collect in loop: 
> blue
> [{   'color': 'blue', 'size': 4}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [   {   'color': 'green', 'size': 9},
> {   'color': 'green', 'size': 5},
> {   'color': 'green', 'size': 50}]
> red
> [   {   'color': 'red', 'size': 3},
> {   'color': 'red', 'size': 7},
> {   'color': 'red', 'size': 8},
> {   'color': 'red', 'size': 10}]
> ### run WITHOUT collect in loop: 
> blue
> [{   'color': 'purple', 'size': 6}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [{   'color': 'purple', 'size': 6}]
> red
> [{   'color': 'purple', 'size': 6}]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12824:
-
Description: 
Below is a simple {{pyspark}} script that tries to split an RDD into a 
dictionary containing several RDDs. 

As the *sample run* shows, the script only works if we do a {{collect()}} on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate {{collect()}} 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the {{collect()}} call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using {{collect()}}?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

h3. spark_script.py

{noformat}
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])
key_field = 'color'
key_values = ['red', 'green', 'blue', 'purple']

print '### run WITH collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
print_results(d)
print '### run WITHOUT collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
print_results(d)
{noformat}

h3. Sample run in IPython shell

{noformat}
In [1]: execfile('spark_script.py')
### run WITH collect in loop: 
blue
[{   'color': 'blue', 'size': 4}]
purple
[{   'color': 'purple', 'size': 6}]
green
[   {   'color': 'green', 'size': 9},
{   'color': 'green', 'size': 5},
{   'color': 'green', 'size': 50}]
red
[   {   'color': 'red', 'size': 3},
{   'color': 'red', 'size': 7},
{   'color': 'red', 'size': 8},
{   'color': 'red', 'size': 10}]
### run WITHOUT collect in loop: 
blue
[{   'color': 'purple', 'size': 6}]
purple
[{   'color': 'purple', 'size': 6}]
green
[{   'color': 'purple', 'size': 6}]
red
[{   'color': 'purple', 'size': 6}]
{noformat}

  was:
Below is a simple `pyspark` script that tries to split an RDD into a dictionary 
containing several RDDs. 

As the **sample run** shows, the script only works if we do a `collect()` on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate `collect()` 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the `collect()` call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using `collect()`?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

spark_script.py
{code}
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 

[jira] [Created] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)
Paul Shearer created SPARK-12824:


 Summary: Failure to maintain consistent RDD references in pyspark
 Key: SPARK-12824
 URL: https://issues.apache.org/jira/browse/SPARK-12824
 Project: Spark
  Issue Type: Bug
 Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

Reporter: Paul Shearer


Below is a simple `pyspark` script that tries to split an RDD into a dictionary 
containing several RDDs. 

As the **sample run** shows, the script only works if we do a `collect()` on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate `collect()` 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the `collect()` call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using `collect()`?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

### spark_script.py

from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])
key_field = 'color'
key_values = ['red', 'green', 'blue', 'purple']

print '### run WITH collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
print_results(d)
print '### run WITHOUT collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
print_results(d)

### Sample run in IPython shell

In [1]: execfile('spark_script.py')
### run WITH collect in loop: 
blue
[{   'color': 'blue', 'size': 4}]
purple
[{   'color': 'purple', 'size': 6}]
green
[   {   'color': 'green', 'size': 9},
{   'color': 'green', 'size': 5},
{   'color': 'green', 'size': 50}]
red
[   {   'color': 'red', 'size': 3},
{   'color': 'red', 'size': 7},
{   'color': 'red', 'size': 8},
{   'color': 'red', 'size': 10}]
### run WITHOUT collect in loop: 
blue
[{   'color': 'purple', 'size': 6}]
purple
[{   'color': 'purple', 'size': 6}]
green
[{   'color': 'purple', 'size': 6}]
red
[{   'color': 'purple', 'size': 6}]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12824:
-
Description: 
Below is a simple {{pyspark}} script that tries to split an RDD into a 
dictionary containing several RDDs. 

As the *sample run* shows, the script only works if we do a {{collect()}} on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate {{collect()}} 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the {{collect()}} call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using {{collect()}}? 

It seems that pyspark is not keeping consistent track of the filter 
transformation applied to the RDD, so the object assigned to the dictionary is 
always the same, even though the RDDs are supposed to be different.

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

h3. spark_script.py

{noformat}
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])
key_field = 'color'
key_values = ['red', 'green', 'blue', 'purple']

print '### run WITH collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
print_results(d)
print '### run WITHOUT collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
print_results(d)
{noformat}

h3. Sample run in IPython shell

{noformat}
In [1]: execfile('spark_script.py')
### run WITH collect in loop: 
blue
[{   'color': 'blue', 'size': 4}]
purple
[{   'color': 'purple', 'size': 6}]
green
[   {   'color': 'green', 'size': 9},
{   'color': 'green', 'size': 5},
{   'color': 'green', 'size': 50}]
red
[   {   'color': 'red', 'size': 3},
{   'color': 'red', 'size': 7},
{   'color': 'red', 'size': 8},
{   'color': 'red', 'size': 10}]
### run WITHOUT collect in loop: 
blue
[{   'color': 'purple', 'size': 6}]
purple
[{   'color': 'purple', 'size': 6}]
green
[{   'color': 'purple', 'size': 6}]
red
[{   'color': 'purple', 'size': 6}]
{noformat}

  was:
Below is a simple {{pyspark}} script that tries to split an RDD into a 
dictionary containing several RDDs. 

As the *sample run* shows, the script only works if we do a {{collect()}} on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate {{collect()}} 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the {{collect()}} call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using {{collect()}}?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

h3. spark_script.py

{noformat}
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([

[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12824:
-
Description: 
Below is a simple `pyspark` script that tries to split an RDD into a dictionary 
containing several RDDs. 

As the **sample run** shows, the script only works if we do a `collect()` on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate `collect()` 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the `collect()` call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using `collect()`?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

h3. spark_script.py

{noformat}
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])
key_field = 'color'
key_values = ['red', 'green', 'blue', 'purple']

print '### run WITH collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
print_results(d)
print '### run WITHOUT collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
print_results(d)
{noformat}

h3. Sample run in IPython shell

{noformat}
In [1]: execfile('spark_script.py')
### run WITH collect in loop: 
blue
[{   'color': 'blue', 'size': 4}]
purple
[{   'color': 'purple', 'size': 6}]
green
[   {   'color': 'green', 'size': 9},
{   'color': 'green', 'size': 5},
{   'color': 'green', 'size': 50}]
red
[   {   'color': 'red', 'size': 3},
{   'color': 'red', 'size': 7},
{   'color': 'red', 'size': 8},
{   'color': 'red', 'size': 10}]
### run WITHOUT collect in loop: 
blue
[{   'color': 'purple', 'size': 6}]
purple
[{   'color': 'purple', 'size': 6}]
green
[{   'color': 'purple', 'size': 6}]
red
[{   'color': 'purple', 'size': 6}]
{noformat}

  was:
Below is a simple `pyspark` script that tries to split an RDD into a dictionary 
containing several RDDs. 

As the **sample run** shows, the script only works if we do a `collect()` on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate `collect()` 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the `collect()` call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using `collect()`?

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

spark_script.py
```
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},

[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12824:
-
Description: 
Below is a simple {{pyspark}} script that tries to split an RDD into a 
dictionary containing several RDDs. 

As the *sample run* shows, the script only works if we do a {{collect()}} on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate {{collect()}} 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the {{collect()}} call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using {{collect()}}? 

It seems that all the keys in the dictionary are referencing the same object 
even though in the code they are clearly supposed to be different objects.

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

h3. spark_script.py

{noformat}
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()
return d
def print_results(d):
for k in d:
print k
pp(d[k].collect())

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])
key_field = 'color'
key_values = ['red', 'green', 'blue', 'purple']

print '### run WITH collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
print_results(d)
print '### run WITHOUT collect in loop: '
d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
print_results(d)
{noformat}

h3. Sample run in IPython shell

{noformat}
In [1]: execfile('spark_script.py')
### run WITH collect in loop: 
blue
[{   'color': 'blue', 'size': 4}]
purple
[{   'color': 'purple', 'size': 6}]
green
[   {   'color': 'green', 'size': 9},
{   'color': 'green', 'size': 5},
{   'color': 'green', 'size': 50}]
red
[   {   'color': 'red', 'size': 3},
{   'color': 'red', 'size': 7},
{   'color': 'red', 'size': 8},
{   'color': 'red', 'size': 10}]
### run WITHOUT collect in loop: 
blue
[{   'color': 'purple', 'size': 6}]
purple
[{   'color': 'purple', 'size': 6}]
green
[{   'color': 'purple', 'size': 6}]
red
[{   'color': 'purple', 'size': 6}]
{noformat}

  was:
Below is a simple {{pyspark}} script that tries to split an RDD into a 
dictionary containing several RDDs. 

As the *sample run* shows, the script only works if we do a {{collect()}} on 
the intermediate RDDs as they are created. Of course I would not want to do 
that in practice, since it doesn't scale.

What's really strange is, I'm not assigning the intermediate {{collect()}} 
results to any variable. So the difference in behavior is due solely to a 
hidden side-effect of the computation triggered by the {{collect()}} call. 

Spark is supposed to be a very functional framework with minimal side effects. 
Why is it only possible to get the desired behavior by triggering some 
mysterious side effect using {{collect()}}? 

It seems that pyspark is not keeping consistent track of the filter 
transformation applied to the RDD, so the object assigned to the dictionary is 
always the same, even though the RDDs are supposed to be different.

The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.

h3. spark_script.py

{noformat}
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=4).pprint
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )

def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
d = dict()
for key_value in key_values:
d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
if collect_in_loop:
d[key_value].collect()

[jira] [Resolved] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer resolved SPARK-12824.
--
Resolution: Not A Problem

This is not so much a Spark issue as a Python gotcha. key_value binds late, 
when the RDD is evaluated, not when the closure is defined. If the collect is 
used inside the loop this happens at the "right" time, but if not, it happens 
after the last iteration of the loop. By then key_value is the final value in 
the loop.

A quick hack to force early binding is to add a default argument:

{noformat}
lambda row, key_value=key_value: row[key_field] == key_value
{noformat}

The other way is with functools.partial (this second way per Nick Chammas).

> Failure to maintain consistent RDD references in pyspark
> 
>
> Key: SPARK-12824
> URL: https://issues.apache.org/jira/browse/SPARK-12824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
>Reporter: Paul Shearer
>
> Below is a simple {{pyspark}} script that tries to split an RDD into a 
> dictionary containing several RDDs. 
> As the *sample run* shows, the script only works if we do a {{collect()}} on 
> the intermediate RDDs as they are created. Of course I would not want to do 
> that in practice, since it doesn't scale.
> What's really strange is, I'm not assigning the intermediate {{collect()}} 
> results to any variable. So the difference in behavior is due solely to a 
> hidden side-effect of the computation triggered by the {{collect()}} call. 
> Spark is supposed to be a very functional framework with minimal side 
> effects. Why is it only possible to get the desired behavior by triggering 
> some mysterious side effect using {{collect()}}? 
> It seems that all the keys in the dictionary are referencing the same object 
> even though in the code they are clearly supposed to be different objects.
> The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
> h3. spark_script.py
> {noformat}
> from pprint import PrettyPrinter
> pp = PrettyPrinter(indent=4).pprint
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
> 
> def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
> d = dict()
> for key_value in key_values:
> d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
> if collect_in_loop:
> d[key_value].collect()
> return d
> def print_results(d):
> for k in d:
> print k
> pp(d[k].collect())
> 
> rdd = sc.parallelize([
> {'color':'red','size':3},
> {'color':'red', 'size':7},
> {'color':'red', 'size':8},
> {'color':'red', 'size':10},
> {'color':'green', 'size':9},
> {'color':'green', 'size':5},
> {'color':'green', 'size':50},
> {'color':'blue', 'size':4},
> {'color':'purple', 'size':6}])
> key_field = 'color'
> key_values = ['red', 'green', 'blue', 'purple']
> 
> print '### run WITH collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
> print_results(d)
> print '### run WITHOUT collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
> print_results(d)
> {noformat}
> h3. Sample run in IPython shell
> {noformat}
> In [1]: execfile('spark_script.py')
> ### run WITH collect in loop: 
> blue
> [{   'color': 'blue', 'size': 4}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [   {   'color': 'green', 'size': 9},
> {   'color': 'green', 'size': 5},
> {   'color': 'green', 'size': 50}]
> red
> [   {   'color': 'red', 'size': 3},
> {   'color': 'red', 'size': 7},
> {   'color': 'red', 'size': 8},
> {   'color': 'red', 'size': 10}]
> ### run WITHOUT collect in loop: 
> blue
> [{   'color': 'purple', 'size': 6}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [{   'color': 'purple', 'size': 6}]
> red
> [{   'color': 'purple', 'size': 6}]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak.

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct=dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
12G of driver and executor memory)
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/24 09:31:52 INFO SparkContext: Running Spark version 1.5.2
15/12/24 09:31:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/12/24 09:31:52 INFO SecurityManager: Changing view acls to: mm86267
15/12/24 09:31:52 INFO SecurityManager: Changing modify acls to: mm86267
15/12/24 09:31:52 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mm86267); users 
with modify permissions: Set(mm86267)
15/12/24 09:31:53 INFO Slf4jLogger: Slf4jLogger started
15/12/24 09:31:53 INFO Remoting: Starting remoting
15/12/24 09:31:53 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.8:49487]
15/12/24 09:31:53 INFO Utils: Successfully started service 'sparkDriver' on 
port 49487.
15/12/24 09:31:53 INFO SparkEnv: Registering MapOutputTracker
15/12/24 09:31:53 INFO SparkEnv: Registering BlockManagerMaster
15/12/24 09:31:53 INFO DiskBlockManager: Created local directory at 
/private/var/folders/yf/56fv4x6j089bnr32mxhf83g0gr/T/blockmgr-b22b2b7b-b3e1-4f33-b1a7-7343a9b88d7d
15/12/24 09:31:53 INFO MemoryStore: MemoryStore started with capacity 6.2 GB
15/12/24 09:31:53 INFO HttpFileServer: HTTP File server directory is 
/private/var/folders/yf/56fv4x6j089bnr32mxhf83g0gr/T/spark-55616fd4-f43b-411d-b578-9335e87ece86/httpd-82c2c6b2-102c-4cde-ae57-c01a3e38df06
15/12/24 09:31:53 INFO HttpServer: Starting HTTP Server
15/12/24 09:31:53 INFO Server: jetty-8.y.z-SNAPSHOT
15/12/24 09:31:53 INFO AbstractConnector: Started SocketConnector@0.0.0.0:49488
15/12/24 09:31:53 INFO Utils: Successfully started service 'HTTP file server' 
on port 49488.
15/12/24 09:31:53 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/24 09:31:53 INFO Server: jetty-8.y.z-SNAPSHOT
15/12/24 09:31:53 INFO AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:4040
15/12/24 09:31:53 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
15/12/24 09:31:53 INFO SparkUI: Started SparkUI at http://192.168.1.8:4040
15/12/24 09:31:53 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
15/12/24 09:31:53 INFO Executor: Starting executor ID driver on host localhost
15/12/24 09:31:53 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 49489.
15/12/24 09:31:53 INFO NettyBlockTransferService: Server created on 49489
15/12/24 09:31:53 INFO BlockManagerMaster: Trying to register BlockManager
15/12/24 09:31:53 INFO BlockManagerMasterEndpoint: Registering block manager 
localhost:49489 with 6.2 GB RAM, BlockManagerId(driver, localhost, 49489)
15/12/24 09:31:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak. Here is a MRE 
that produces the bug on my machine:

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct = dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak. Here is a 
reproducible example:

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct = dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak. Here is a 
minimal example that reproduces the bug on my machine:

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct = dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|

[jira] [Created] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)
Paul Shearer created SPARK-12519:


 Summary: "Managed memory leak detected" when using distinct on 
PySpark DataFrame
 Key: SPARK-12519
 URL: https://issues.apache.org/jira/browse/SPARK-12519
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2
 Environment: OS X 10.9.5, Java 1.8.0_66
Reporter: Paul Shearer


After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak.

# Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct=dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

# Output
{noformat}
In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
|L2DC9N|N0NQSH|N3SMU0|VRSSPM|7TGRZ9|1FP90A|Z9KB0U|CWOH6I|O2WNSY|IJEUNA|MTJQXG|CAT0VD|5SL8A0|R6SX6H|9ZSVL1|HWPTBR|4SBQPN|4GPD0Z|ZQ72K5|EIVYSE|
|X6MH6R|VM5M86|ZV1H22|Z5V1FX|XRZSGC|L39Q1R|1OT5XB|84NY6I|IXKYXQ|KY2U4G|F13S00|CZRR3E|ZIAVU0|DU2BAB|27KBZ8|XWBB7G|09V69R|LTXJ4U|8GP3EM|P3WVAX|
|1IPOKL|9EIG2Q|UQJV00|RXJGCK|X20VBH|CZB7SQ|THZ95A|V90YSH|9QTKCW|0RLYJO|WSTNYK|UXZYST|WT8OHL|KE31OO|C0ZKRE|9VSDJF|6Z3JAR|RR0KMB|R3J61U|EPNRZL|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak.

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct=dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark, I get this error:
12G of driver and executor memory)
{noformat}
pyspark --executor-memory 12G --driver-memory 12G

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
|L2DC9N|N0NQSH|N3SMU0|VRSSPM|7TGRZ9|1FP90A|Z9KB0U|CWOH6I|O2WNSY|IJEUNA|MTJQXG|CAT0VD|5SL8A0|R6SX6H|9ZSVL1|HWPTBR|4SBQPN|4GPD0Z|ZQ72K5|EIVYSE|
|X6MH6R|VM5M86|ZV1H22|Z5V1FX|XRZSGC|L39Q1R|1OT5XB|84NY6I|IXKYXQ|KY2U4G|F13S00|CZRR3E|ZIAVU0|DU2BAB|27KBZ8|XWBB7G|09V69R|LTXJ4U|8GP3EM|P3WVAX|
|1IPOKL|9EIG2Q|UQJV00|RXJGCK|X20VBH|CZB7SQ|THZ95A|V90YSH|9QTKCW|0RLYJO|WSTNYK|UXZYST|WT8OHL|KE31OO|C0ZKRE|9VSDJF|6Z3JAR|RR0KMB|R3J61U|EPNRZL|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
{noformat}

  was:
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak.

# Script
{noformat}
logger = sc._jvm.org.apache.log4j

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak.

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct=dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
12G of driver and executor memory)
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak.

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct = dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak.

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct()
ct=dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
|6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak. Here is a 
minimal example that reproduces the bug on my machine:

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct() # if this line is commented out, no memory leak will be 
reported
ct = dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|

[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame

2015-12-24 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-12519:
-
Description: 
After running the distinct() method to transform a DataFrame, subsequent 
actions like count() and show() may report a managed memory leak. Here is a 
minimal example that reproduces the bug on my machine:

h1. Script
{noformat}
logger = sc._jvm.org.apache.log4j
logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN )
logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN )

import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))

nrow   = 8
ncol   = 20
ndrow  = 4 # number distinct rows
tmp = [id_generator() for i in xrange(ndrow*ncol)]
tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in 
xrange(nrow)] 

dat = sc.parallelize(tmp,1000).toDF()
dat = dat.distinct() # if this line is commented out, no memory leak will be 
reported
# dat = dat.rdd.distinct().toDF() # if this line is used instead of the above, 
no leak
ct = dat.count()
print ct  
# memory leak warning prints at this point in the code
dat.show()  
{noformat}

h1. Output

When this script is run in PySpark (with IPython kernel), I get this error:
{noformat}
$ pyspark --executor-memory 12G --driver-memory 12G
Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) 
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help  -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
<<<... usual loading info...>>>
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkContext available as sc, SQLContext available as sqlContext.

In [1]: execfile('bugtest.py')
4
15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 
bytes, TID = 2202
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|_1|_2|_3|_4|_5|_6|_7|_8|_9|   _10|   _11|  
 _12|   _13|   _14|   _15|   _16|   _17|   _18|   _19|   _20|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK|
|BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5|
|IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ|
|S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR|
|U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV|
|3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L|
|XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL|
|4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL|
|USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH|
|0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU|
|1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ|
|B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD|
|G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH|
|A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD|
|3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM|
|FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|