[jira] [Commented] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250055#comment-15250055 ] Paul Shearer commented on SPARK-14740: -- It appears it's accessible through cvModel.bestModel._java_obj.getRegParam(), just need to get this in the right place in the code. > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For the original issue raised on StackOverflow please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Component/s: (was: Spark Core) PySpark MLlib > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For the original issue raised on StackOverflow please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249921#comment-15249921 ] Paul Shearer commented on SPARK-13973: -- Done: https://github.com/apache/spark/pull/12528 I'm a bit new to this, happy to fix any issues. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249705#comment-15249705 ] Paul Shearer edited comment on SPARK-13973 at 4/20/16 12:09 PM: Bottom line... I think IPYTHON=1 should either (1) mean what it appears to mean - IPython and not necessarily the notebook - or (2) be removed entirely as too confusing. As the code now stands, the IPython shell user's pyspark is silently broken if they happen to have set a deprecated option when using an older version of pyspark. Fixable yes, but more confusing than necessary, and opposite to the intention of backwards compatibility. was (Author: pshearer): Bottom line... I think IPYTHON=1 should either (1) mean what it appears to mean - IPython and not necessarily the notebook - or (2) be removed entirely as too confusing. As the code now stands, the IPython shell user's pyspark is silently broken if they happen to have set a deprecated option when using an older version of pyspark. Not exactly the definition of backwards compatibility. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249705#comment-15249705 ] Paul Shearer edited comment on SPARK-13973 at 4/20/16 12:08 PM: Bottom line... I think IPYTHON=1 should either (1) mean what it appears to mean - IPython and not necessarily the notebook - or (2) be removed entirely as too confusing. As the code now stands, the IPython shell user's pyspark is silently broken if they happen to have set a deprecated option when using an older version of pyspark. Not exactly the definition of backwards compatibility. was (Author: pshearer): Bottom line... I think IPYTHON=1 should either (1) mean what it appears to mean - IPython and not necessarily the notebook - or (2) be removed entirely as too confusing. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249705#comment-15249705 ] Paul Shearer commented on SPARK-13973: -- Bottom line... I think IPYTHON=1 should either (1) mean what it appears to mean - IPython and not necessarily the notebook - or (2) be removed entirely as too confusing. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656 ] Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:36 AM: In my development experience and those I know, the notebook is not the only or even the main use case for IPython. It's just the most visible because it's easy to make slides and such. IPython, without the notebook, is a productive general-purpose programming environment for data scientists. Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for general programming / power users. And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. I found it easy enough just to comment out `IPYTHON=1` once I realized that was the problem, so alternatively you could print some alert that this is deprecated and tell people how to change it, so you don't have to keep supporting it. was (Author: pshearer): Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for general programming / power users. And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. I found it easy enough just to comment out `IPYTHON=1` once I realized that was the problem, so alternatively you could print some alert that this is deprecated and tell people how to change it, so you don't have to keep supporting it. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656 ] Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:36 AM: In my development experience and those I know, the notebook is not the only or even the main use case for IPython. It's just the most visible because it's easy to make slides and webpages and such. IPython, without the notebook, is a productive general-purpose programming environment for data scientists. Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for general programming / power users. And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. I found it easy enough just to comment out `IPYTHON=1` once I realized that was the problem, so alternatively you could print some alert that this is deprecated and tell people how to change it, so you don't have to keep supporting it. was (Author: pshearer): In my development experience and those I know, the notebook is not the only or even the main use case for IPython. It's just the most visible because it's easy to make slides and such. IPython, without the notebook, is a productive general-purpose programming environment for data scientists. Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for general programming / power users. And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. I found it easy enough just to comment out `IPYTHON=1` once I realized that was the problem, so alternatively you could print some alert that this is deprecated and tell people how to change it, so you don't have to keep supporting it. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by
[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656 ] Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:18 AM: Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for general programming / power users. And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. I found it easy enough just to comment out `IPYTHON=1` once I realized that was the problem, so alternatively you could print some alert that this is deprecated and tell people how to change it, so you don't have to keep supporting it. was (Author: pshearer): Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for power users. And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. I found it easy enough just to comment out `IPYTHON=1` once I realized that was the problem, so alternatively you could print some alert that this is deprecated and tell people how to change it, so you don't have to keep supporting it. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656 ] Paul Shearer edited comment on SPARK-13973 at 4/20/16 11:17 AM: Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for power users. And yes, not overriding `PYSPARK_DRIVER_PYTHON` would avoid breaking my setup. I found it easy enough just to comment out `IPYTHON=1` once I realized that was the problem, so alternatively you could print some alert that this is deprecated and tell people how to change it, so you don't have to keep supporting it. was (Author: pshearer): Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for power users. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249656#comment-15249656 ] Paul Shearer commented on SPARK-13973: -- Just to get us on the same page - pyspark is a python shell with SparkContext pre-loaded. The python shell can be either default python shell or the enhanced IPython shell. The latter offers tab completion of object attributes, nicely formatted tracebacks, "magic" macros for executing pasted code snippets, scripting, debugging, and accessing help/docstrings. It is a separate entity from the notebook and long predates the notebook. Most working scientists and analysts I know, including myself, use the IPython shell much more than the notebook. The notebook is more of a presentation and exploratory analysis tool, while the IPython shell is better for power users. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249628#comment-15249628 ] Paul Shearer edited comment on SPARK-13973 at 4/20/16 10:48 AM: The problem with this change is that it creates a bug for users who simply want the IPython interactive shell, as opposed to the notebook. `ipython` with no arguments starts the IPython shell, but `jupyter` with no arguments results in the following error: {noformat} usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir] [--paths] [--json] [subcommand] jupyter: error: one of the arguments --version subcommand --config-dir --data-dir --runtime-dir --paths is required {noformat} I can't speak for the general Python community but as a data scientist, personally I find the IPython notebook only suitable for very basic exploratory analysis - any sort of application development is much better served by the IPython shell, so I'm always using the shell and rarely the notebook. So I prefer the old script. Perhaps the best answer is to stop maintaining an unsustainable backwards compatibility. The committed change broke the pyspark startup script in my case, and the old startup script will eventually be broken when `ipython notebook` is deprecated. So perhaps `IPYTHON=1` should just result in some kind of error message prompting the user to switch to the new PYSPARK_DRIVER_PYTHON config style. Most Spark users knows the installation process is not seamless and requires mucking about with environment variables - they might as well be told to do it in a way that's convenient to the development team. was (Author: pshearer): The problem with this change is that it creates a bug for users who simply want the IPython interactive shell, as opposed to the notebook. `ipython` with no arguments starts the IPython shell, but `jupyter` with no arguments results in the following error: {noformat} usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir] [--paths] [--json] [subcommand] jupyter: error: one of the arguments --version subcommand --config-dir --data-dir --runtime-dir --paths is required {noformat} I can't speak for the general Python community but as a data scientist, personally I find the IPython notebook only suitable for very basic exploratory analysis - any sort of application development is much better served by the IPython shell, so I'm always using the shell and rarely the notebook. It seems like maintaining this old configuration switch is no longer sustainable. The change breaks it for my case, and the old state will eventually be broken when `ipython notebook` is deprecated. So perhaps `IPYTHON=1` should just result in some kind of error message prompting the user to switch to the new PYSPARK_DRIVER_PYTHON config style. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} --
[jira] [Commented] (SPARK-13973) `ipython notebook` is going away...
[ https://issues.apache.org/jira/browse/SPARK-13973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249628#comment-15249628 ] Paul Shearer commented on SPARK-13973: -- The problem with this change is that it creates a bug for users who simply want the IPython interactive shell, as opposed to the notebook. `ipython` with no arguments starts the IPython shell, but `jupyter` with no arguments results in the following error: {noformat} usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir] [--paths] [--json] [subcommand] jupyter: error: one of the arguments --version subcommand --config-dir --data-dir --runtime-dir --paths is required {noformat} I can't speak for the general Python community but as a data scientist, personally I find the IPython notebook only suitable for very basic exploratory analysis - any sort of application development is much better served by the IPython shell, so I'm always using the shell and rarely the notebook. It seems like maintaining this old configuration switch is no longer sustainable. The change breaks it for my case, and the old state will eventually be broken when `ipython notebook` is deprecated. So perhaps `IPYTHON=1` should just result in some kind of error message prompting the user to switch to the new PYSPARK_DRIVER_PYTHON config style. > `ipython notebook` is going away... > --- > > Key: SPARK-13973 > URL: https://issues.apache.org/jira/browse/SPARK-13973 > Project: Spark > Issue Type: Improvement > Components: PySpark > Environment: spark-1.6.1-bin-hadoop2.6 > Anaconda2-2.5.0-Linux-x86_64 >Reporter: Bogdan Pirvu >Assignee: Rekha Joshi >Priority: Trivial > Fix For: 2.0.0 > > > Starting {{pyspark}} with following environment variables: > {code:none} > export IPYTHON=1 > export IPYTHON_OPTS="notebook --no-browser" > {code} > yields this warning > {code:none} > [TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated > and will be removed in future versions. > [TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook`... > continue in 5 sec. Press Ctrl-C to quit now. > {code} > Changing line 52 from > {code:none} > PYSPARK_DRIVER_PYTHON="ipython" > {code} > to > {code:none} > PYSPARK_DRIVER_PYTHON="jupyter" > {code} > in https://github.com/apache/spark/blob/master/bin/pyspark works for me to > solve this issue, but I'm not sure if it's sustainable as I'm not familiar > with the rest of the code... > This is the relevant part of my Python environment: > {code:none} > ipython 4.1.2py27_0 > ipython-genutils 0.1.0 > ipython_genutils 0.1.0py27_0 > ipywidgets4.1.1py27_0 > ... > jupyter 1.0.0py27_1 > jupyter-client4.2.1 > jupyter-console 4.1.1 > jupyter-core 4.1.0 > jupyter_client4.2.1py27_0 > jupyter_console 4.1.1py27_0 > jupyter_core 4.1.0py27_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For the original issue raised on StackOverflow please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For the original issue raised on StackOverflow please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {{ from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) }} I can get the regression coefficient out, but I can't get the regularization parameter {{ In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] }} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For a simple example please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For the original issue on StackOverflow please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ``` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ``` I can get the regression coefficient out, but I can't get the regularization parameter ``` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ``` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For the original issue on StackOverflow please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {{ from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) }} I can get the regression coefficient out, but I can't get the regularization parameter {{ In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] }} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For the original issue on StackOverflow please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {{ > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > }} > I can get the regression coefficient out, but I can't get the regularization > parameter > {{ > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > }} > For a simple example please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ``` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ``` I can get the regression coefficient out, but I can't get the regularization parameter ``` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ``` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ` I can get the regression coefficient out, but I can't get the regularization parameter ` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > ``` > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > ``` > I can get the regression coefficient out, but I can't get the regularization > parameter > ``` > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > ``` > For a simple example please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Created] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
Paul Shearer created SPARK-14740: Summary: CrossValidatorModel.bestModel does not include hyper-parameters Key: SPARK-14740 URL: https://issues.apache.org/jira/browse/SPARK-14740 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Reporter: Paul Shearer If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ` I can get the regression coefficient out, but I can't get the regularization parameter ` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14241: - Description: If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph: http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 >From a user perspective this behavior is very unexpected, and many things one >would normally like to do with an ID column are in fact only possible under >very narrow circumstances. The function should either be made deterministic, >or there should be a prominent warning note in the API docs regarding its >behavior. was: If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph: http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 >From a user perspective this behavior is very unexpected, and many things one >would like to do with an ID column are only possible under very narrow >circumstances. The function should either be made deterministic, or there >should be a prominent warning note in the API docs regarding its behavior. > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
Paul Shearer created SPARK-14241: Summary: Output of monotonically_increasing_id lacks stable relation with rows of DataFrame Key: SPARK-14241 URL: https://issues.apache.org/jira/browse/SPARK-14241 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.6.1, 1.6.0 Reporter: Paul Shearer If you use monotonically_increasing_id() to append a column of IDs to a DataFrame, the IDs do not have a stable, deterministic relationship to the rows they are appended to. A given ID value can land on different rows depending on what happens in the task graph: http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 >From a user perspective this behavior is very unexpected, and many things one >would like to do with an ID column are only possible under very narrow >circumstances. The function should either be made deterministic, or there >should be a prominent warning note in the API docs regarding its behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12824: - Description: Below is a simple `pyspark` script that tries to split an RDD into a dictionary containing several RDDs. As the **sample run** shows, the script only works if we do a `collect()` on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate `collect()` results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the `collect()` call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using `collect()`? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. spark_script.py ``` from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4}, {'color':'purple', 'size':6}]) key_field = 'color' key_values = ['red', 'green', 'blue', 'purple'] print '### run WITH collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) print_results(d) print '### run WITHOUT collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) print_results(d) ``` Sample run in IPython shell ``` In [1]: execfile('spark_script.py') ### run WITH collect in loop: blue [{ 'color': 'blue', 'size': 4}] purple [{ 'color': 'purple', 'size': 6}] green [ { 'color': 'green', 'size': 9}, { 'color': 'green', 'size': 5}, { 'color': 'green', 'size': 50}] red [ { 'color': 'red', 'size': 3}, { 'color': 'red', 'size': 7}, { 'color': 'red', 'size': 8}, { 'color': 'red', 'size': 10}] ### run WITHOUT collect in loop: blue [{ 'color': 'purple', 'size': 6}] purple [{ 'color': 'purple', 'size': 6}] green [{ 'color': 'purple', 'size': 6}] red [{ 'color': 'purple', 'size': 6}] ``` was: Below is a simple `pyspark` script that tries to split an RDD into a dictionary containing several RDDs. As the **sample run** shows, the script only works if we do a `collect()` on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate `collect()` results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the `collect()` call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using `collect()`? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. ### spark_script.py from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4},
[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12824: - Affects Version/s: 1.5.2 > Failure to maintain consistent RDD references in pyspark > > > Key: SPARK-12824 > URL: https://issues.apache.org/jira/browse/SPARK-12824 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. >Reporter: Paul Shearer > > Below is a simple {{pyspark}} script that tries to split an RDD into a > dictionary containing several RDDs. > As the *sample run* shows, the script only works if we do a {{collect()}} on > the intermediate RDDs as they are created. Of course I would not want to do > that in practice, since it doesn't scale. > What's really strange is, I'm not assigning the intermediate {{collect()}} > results to any variable. So the difference in behavior is due solely to a > hidden side-effect of the computation triggered by the {{collect()}} call. > Spark is supposed to be a very functional framework with minimal side > effects. Why is it only possible to get the desired behavior by triggering > some mysterious side effect using {{collect()}}? > It seems that pyspark is not keeping consistent track of the filter > transformation applied to the RDD, so the object assigned to the dictionary > is always the same, even though the RDDs are supposed to be different. > The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. > h3. spark_script.py > {noformat} > from pprint import PrettyPrinter > pp = PrettyPrinter(indent=4).pprint > logger = sc._jvm.org.apache.log4j > logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) > logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) > > def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): > d = dict() > for key_value in key_values: > d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) > if collect_in_loop: > d[key_value].collect() > return d > def print_results(d): > for k in d: > print k > pp(d[k].collect()) > > rdd = sc.parallelize([ > {'color':'red','size':3}, > {'color':'red', 'size':7}, > {'color':'red', 'size':8}, > {'color':'red', 'size':10}, > {'color':'green', 'size':9}, > {'color':'green', 'size':5}, > {'color':'green', 'size':50}, > {'color':'blue', 'size':4}, > {'color':'purple', 'size':6}]) > key_field = 'color' > key_values = ['red', 'green', 'blue', 'purple'] > > print '### run WITH collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) > print_results(d) > print '### run WITHOUT collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) > print_results(d) > {noformat} > h3. Sample run in IPython shell > {noformat} > In [1]: execfile('spark_script.py') > ### run WITH collect in loop: > blue > [{ 'color': 'blue', 'size': 4}] > purple > [{ 'color': 'purple', 'size': 6}] > green > [ { 'color': 'green', 'size': 9}, > { 'color': 'green', 'size': 5}, > { 'color': 'green', 'size': 50}] > red > [ { 'color': 'red', 'size': 3}, > { 'color': 'red', 'size': 7}, > { 'color': 'red', 'size': 8}, > { 'color': 'red', 'size': 10}] > ### run WITHOUT collect in loop: > blue > [{ 'color': 'purple', 'size': 6}] > purple > [{ 'color': 'purple', 'size': 6}] > green > [{ 'color': 'purple', 'size': 6}] > red > [{ 'color': 'purple', 'size': 6}] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12824: - Description: Below is a simple {{pyspark}} script that tries to split an RDD into a dictionary containing several RDDs. As the *sample run* shows, the script only works if we do a {{collect()}} on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate {{collect()}} results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the {{collect()}} call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using {{collect()}}? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. h3. spark_script.py {noformat} from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4}, {'color':'purple', 'size':6}]) key_field = 'color' key_values = ['red', 'green', 'blue', 'purple'] print '### run WITH collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) print_results(d) print '### run WITHOUT collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) print_results(d) {noformat} h3. Sample run in IPython shell {noformat} In [1]: execfile('spark_script.py') ### run WITH collect in loop: blue [{ 'color': 'blue', 'size': 4}] purple [{ 'color': 'purple', 'size': 6}] green [ { 'color': 'green', 'size': 9}, { 'color': 'green', 'size': 5}, { 'color': 'green', 'size': 50}] red [ { 'color': 'red', 'size': 3}, { 'color': 'red', 'size': 7}, { 'color': 'red', 'size': 8}, { 'color': 'red', 'size': 10}] ### run WITHOUT collect in loop: blue [{ 'color': 'purple', 'size': 6}] purple [{ 'color': 'purple', 'size': 6}] green [{ 'color': 'purple', 'size': 6}] red [{ 'color': 'purple', 'size': 6}] {noformat} was: Below is a simple `pyspark` script that tries to split an RDD into a dictionary containing several RDDs. As the **sample run** shows, the script only works if we do a `collect()` on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate `collect()` results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the `collect()` call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using `collect()`? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. spark_script.py {code} from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green',
[jira] [Created] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
Paul Shearer created SPARK-12824: Summary: Failure to maintain consistent RDD references in pyspark Key: SPARK-12824 URL: https://issues.apache.org/jira/browse/SPARK-12824 Project: Spark Issue Type: Bug Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. Reporter: Paul Shearer Below is a simple `pyspark` script that tries to split an RDD into a dictionary containing several RDDs. As the **sample run** shows, the script only works if we do a `collect()` on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate `collect()` results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the `collect()` call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using `collect()`? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. ### spark_script.py from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4}, {'color':'purple', 'size':6}]) key_field = 'color' key_values = ['red', 'green', 'blue', 'purple'] print '### run WITH collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) print_results(d) print '### run WITHOUT collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) print_results(d) ### Sample run in IPython shell In [1]: execfile('spark_script.py') ### run WITH collect in loop: blue [{ 'color': 'blue', 'size': 4}] purple [{ 'color': 'purple', 'size': 6}] green [ { 'color': 'green', 'size': 9}, { 'color': 'green', 'size': 5}, { 'color': 'green', 'size': 50}] red [ { 'color': 'red', 'size': 3}, { 'color': 'red', 'size': 7}, { 'color': 'red', 'size': 8}, { 'color': 'red', 'size': 10}] ### run WITHOUT collect in loop: blue [{ 'color': 'purple', 'size': 6}] purple [{ 'color': 'purple', 'size': 6}] green [{ 'color': 'purple', 'size': 6}] red [{ 'color': 'purple', 'size': 6}] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12824: - Description: Below is a simple {{pyspark}} script that tries to split an RDD into a dictionary containing several RDDs. As the *sample run* shows, the script only works if we do a {{collect()}} on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate {{collect()}} results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the {{collect()}} call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using {{collect()}}? It seems that pyspark is not keeping consistent track of the filter transformation applied to the RDD, so the object assigned to the dictionary is always the same, even though the RDDs are supposed to be different. The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. h3. spark_script.py {noformat} from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4}, {'color':'purple', 'size':6}]) key_field = 'color' key_values = ['red', 'green', 'blue', 'purple'] print '### run WITH collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) print_results(d) print '### run WITHOUT collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) print_results(d) {noformat} h3. Sample run in IPython shell {noformat} In [1]: execfile('spark_script.py') ### run WITH collect in loop: blue [{ 'color': 'blue', 'size': 4}] purple [{ 'color': 'purple', 'size': 6}] green [ { 'color': 'green', 'size': 9}, { 'color': 'green', 'size': 5}, { 'color': 'green', 'size': 50}] red [ { 'color': 'red', 'size': 3}, { 'color': 'red', 'size': 7}, { 'color': 'red', 'size': 8}, { 'color': 'red', 'size': 10}] ### run WITHOUT collect in loop: blue [{ 'color': 'purple', 'size': 6}] purple [{ 'color': 'purple', 'size': 6}] green [{ 'color': 'purple', 'size': 6}] red [{ 'color': 'purple', 'size': 6}] {noformat} was: Below is a simple {{pyspark}} script that tries to split an RDD into a dictionary containing several RDDs. As the *sample run* shows, the script only works if we do a {{collect()}} on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate {{collect()}} results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the {{collect()}} call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using {{collect()}}? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. h3. spark_script.py {noformat} from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([
[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12824: - Description: Below is a simple `pyspark` script that tries to split an RDD into a dictionary containing several RDDs. As the **sample run** shows, the script only works if we do a `collect()` on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate `collect()` results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the `collect()` call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using `collect()`? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. h3. spark_script.py {noformat} from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4}, {'color':'purple', 'size':6}]) key_field = 'color' key_values = ['red', 'green', 'blue', 'purple'] print '### run WITH collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) print_results(d) print '### run WITHOUT collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) print_results(d) {noformat} h3. Sample run in IPython shell {noformat} In [1]: execfile('spark_script.py') ### run WITH collect in loop: blue [{ 'color': 'blue', 'size': 4}] purple [{ 'color': 'purple', 'size': 6}] green [ { 'color': 'green', 'size': 9}, { 'color': 'green', 'size': 5}, { 'color': 'green', 'size': 50}] red [ { 'color': 'red', 'size': 3}, { 'color': 'red', 'size': 7}, { 'color': 'red', 'size': 8}, { 'color': 'red', 'size': 10}] ### run WITHOUT collect in loop: blue [{ 'color': 'purple', 'size': 6}] purple [{ 'color': 'purple', 'size': 6}] green [{ 'color': 'purple', 'size': 6}] red [{ 'color': 'purple', 'size': 6}] {noformat} was: Below is a simple `pyspark` script that tries to split an RDD into a dictionary containing several RDDs. As the **sample run** shows, the script only works if we do a `collect()` on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate `collect()` results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the `collect()` call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using `collect()`? The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. spark_script.py ``` from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50},
[jira] [Updated] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12824: - Description: Below is a simple {{pyspark}} script that tries to split an RDD into a dictionary containing several RDDs. As the *sample run* shows, the script only works if we do a {{collect()}} on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate {{collect()}} results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the {{collect()}} call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using {{collect()}}? It seems that all the keys in the dictionary are referencing the same object even though in the code they are clearly supposed to be different objects. The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. h3. spark_script.py {noformat} from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect() return d def print_results(d): for k in d: print k pp(d[k].collect()) rdd = sc.parallelize([ {'color':'red','size':3}, {'color':'red', 'size':7}, {'color':'red', 'size':8}, {'color':'red', 'size':10}, {'color':'green', 'size':9}, {'color':'green', 'size':5}, {'color':'green', 'size':50}, {'color':'blue', 'size':4}, {'color':'purple', 'size':6}]) key_field = 'color' key_values = ['red', 'green', 'blue', 'purple'] print '### run WITH collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) print_results(d) print '### run WITHOUT collect in loop: ' d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) print_results(d) {noformat} h3. Sample run in IPython shell {noformat} In [1]: execfile('spark_script.py') ### run WITH collect in loop: blue [{ 'color': 'blue', 'size': 4}] purple [{ 'color': 'purple', 'size': 6}] green [ { 'color': 'green', 'size': 9}, { 'color': 'green', 'size': 5}, { 'color': 'green', 'size': 50}] red [ { 'color': 'red', 'size': 3}, { 'color': 'red', 'size': 7}, { 'color': 'red', 'size': 8}, { 'color': 'red', 'size': 10}] ### run WITHOUT collect in loop: blue [{ 'color': 'purple', 'size': 6}] purple [{ 'color': 'purple', 'size': 6}] green [{ 'color': 'purple', 'size': 6}] red [{ 'color': 'purple', 'size': 6}] {noformat} was: Below is a simple {{pyspark}} script that tries to split an RDD into a dictionary containing several RDDs. As the *sample run* shows, the script only works if we do a {{collect()}} on the intermediate RDDs as they are created. Of course I would not want to do that in practice, since it doesn't scale. What's really strange is, I'm not assigning the intermediate {{collect()}} results to any variable. So the difference in behavior is due solely to a hidden side-effect of the computation triggered by the {{collect()}} call. Spark is supposed to be a very functional framework with minimal side effects. Why is it only possible to get the desired behavior by triggering some mysterious side effect using {{collect()}}? It seems that pyspark is not keeping consistent track of the filter transformation applied to the RDD, so the object assigned to the dictionary is always the same, even though the RDDs are supposed to be different. The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. h3. spark_script.py {noformat} from pprint import PrettyPrinter pp = PrettyPrinter(indent=4).pprint logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): d = dict() for key_value in key_values: d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) if collect_in_loop: d[key_value].collect()
[jira] [Resolved] (SPARK-12824) Failure to maintain consistent RDD references in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer resolved SPARK-12824. -- Resolution: Not A Problem This is not so much a Spark issue as a Python gotcha. key_value binds late, when the RDD is evaluated, not when the closure is defined. If the collect is used inside the loop this happens at the "right" time, but if not, it happens after the last iteration of the loop. By then key_value is the final value in the loop. A quick hack to force early binding is to add a default argument: {noformat} lambda row, key_value=key_value: row[key_field] == key_value {noformat} The other way is with functools.partial (this second way per Nick Chammas). > Failure to maintain consistent RDD references in pyspark > > > Key: SPARK-12824 > URL: https://issues.apache.org/jira/browse/SPARK-12824 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. >Reporter: Paul Shearer > > Below is a simple {{pyspark}} script that tries to split an RDD into a > dictionary containing several RDDs. > As the *sample run* shows, the script only works if we do a {{collect()}} on > the intermediate RDDs as they are created. Of course I would not want to do > that in practice, since it doesn't scale. > What's really strange is, I'm not assigning the intermediate {{collect()}} > results to any variable. So the difference in behavior is due solely to a > hidden side-effect of the computation triggered by the {{collect()}} call. > Spark is supposed to be a very functional framework with minimal side > effects. Why is it only possible to get the desired behavior by triggering > some mysterious side effect using {{collect()}}? > It seems that all the keys in the dictionary are referencing the same object > even though in the code they are clearly supposed to be different objects. > The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0. > h3. spark_script.py > {noformat} > from pprint import PrettyPrinter > pp = PrettyPrinter(indent=4).pprint > logger = sc._jvm.org.apache.log4j > logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR ) > logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR ) > > def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False): > d = dict() > for key_value in key_values: > d[key_value] = rdd.filter(lambda row: row[key_field] == key_value) > if collect_in_loop: > d[key_value].collect() > return d > def print_results(d): > for k in d: > print k > pp(d[k].collect()) > > rdd = sc.parallelize([ > {'color':'red','size':3}, > {'color':'red', 'size':7}, > {'color':'red', 'size':8}, > {'color':'red', 'size':10}, > {'color':'green', 'size':9}, > {'color':'green', 'size':5}, > {'color':'green', 'size':50}, > {'color':'blue', 'size':4}, > {'color':'purple', 'size':6}]) > key_field = 'color' > key_values = ['red', 'green', 'blue', 'purple'] > > print '### run WITH collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True) > print_results(d) > print '### run WITHOUT collect in loop: ' > d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False) > print_results(d) > {noformat} > h3. Sample run in IPython shell > {noformat} > In [1]: execfile('spark_script.py') > ### run WITH collect in loop: > blue > [{ 'color': 'blue', 'size': 4}] > purple > [{ 'color': 'purple', 'size': 6}] > green > [ { 'color': 'green', 'size': 9}, > { 'color': 'green', 'size': 5}, > { 'color': 'green', 'size': 50}] > red > [ { 'color': 'red', 'size': 3}, > { 'color': 'red', 'size': 7}, > { 'color': 'red', 'size': 8}, > { 'color': 'red', 'size': 10}] > ### run WITHOUT collect in loop: > blue > [{ 'color': 'purple', 'size': 6}] > purple > [{ 'color': 'purple', 'size': 6}] > green > [{ 'color': 'purple', 'size': 6}] > red > [{ 'color': 'purple', 'size': 6}] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct=dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: 12G of driver and executor memory) {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/12/24 09:31:52 INFO SparkContext: Running Spark version 1.5.2 15/12/24 09:31:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/12/24 09:31:52 INFO SecurityManager: Changing view acls to: mm86267 15/12/24 09:31:52 INFO SecurityManager: Changing modify acls to: mm86267 15/12/24 09:31:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mm86267); users with modify permissions: Set(mm86267) 15/12/24 09:31:53 INFO Slf4jLogger: Slf4jLogger started 15/12/24 09:31:53 INFO Remoting: Starting remoting 15/12/24 09:31:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.8:49487] 15/12/24 09:31:53 INFO Utils: Successfully started service 'sparkDriver' on port 49487. 15/12/24 09:31:53 INFO SparkEnv: Registering MapOutputTracker 15/12/24 09:31:53 INFO SparkEnv: Registering BlockManagerMaster 15/12/24 09:31:53 INFO DiskBlockManager: Created local directory at /private/var/folders/yf/56fv4x6j089bnr32mxhf83g0gr/T/blockmgr-b22b2b7b-b3e1-4f33-b1a7-7343a9b88d7d 15/12/24 09:31:53 INFO MemoryStore: MemoryStore started with capacity 6.2 GB 15/12/24 09:31:53 INFO HttpFileServer: HTTP File server directory is /private/var/folders/yf/56fv4x6j089bnr32mxhf83g0gr/T/spark-55616fd4-f43b-411d-b578-9335e87ece86/httpd-82c2c6b2-102c-4cde-ae57-c01a3e38df06 15/12/24 09:31:53 INFO HttpServer: Starting HTTP Server 15/12/24 09:31:53 INFO Server: jetty-8.y.z-SNAPSHOT 15/12/24 09:31:53 INFO AbstractConnector: Started SocketConnector@0.0.0.0:49488 15/12/24 09:31:53 INFO Utils: Successfully started service 'HTTP file server' on port 49488. 15/12/24 09:31:53 INFO SparkEnv: Registering OutputCommitCoordinator 15/12/24 09:31:53 INFO Server: jetty-8.y.z-SNAPSHOT 15/12/24 09:31:53 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/12/24 09:31:53 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/12/24 09:31:53 INFO SparkUI: Started SparkUI at http://192.168.1.8:4040 15/12/24 09:31:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 15/12/24 09:31:53 INFO Executor: Starting executor ID driver on host localhost 15/12/24 09:31:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49489. 15/12/24 09:31:53 INFO NettyBlockTransferService: Server created on 49489 15/12/24 09:31:53 INFO BlockManagerMaster: Trying to register BlockManager 15/12/24 09:31:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:49489 with 6.2 GB RAM, BlockManagerId(driver, localhost, 49489) 15/12/24 09:31:53 INFO BlockManagerMaster: Registered BlockManager Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. Here is a MRE that produces the bug on my machine: h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct = dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. Here is a reproducible example: h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct = dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. Here is a minimal example that reproduces the bug on my machine: h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct = dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
[jira] [Created] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
Paul Shearer created SPARK-12519: Summary: "Managed memory leak detected" when using distinct on PySpark DataFrame Key: SPARK-12519 URL: https://issues.apache.org/jira/browse/SPARK-12519 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.2 Environment: OS X 10.9.5, Java 1.8.0_66 Reporter: Paul Shearer After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. # Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct=dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} # Output {noformat} In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6| |L2DC9N|N0NQSH|N3SMU0|VRSSPM|7TGRZ9|1FP90A|Z9KB0U|CWOH6I|O2WNSY|IJEUNA|MTJQXG|CAT0VD|5SL8A0|R6SX6H|9ZSVL1|HWPTBR|4SBQPN|4GPD0Z|ZQ72K5|EIVYSE| |X6MH6R|VM5M86|ZV1H22|Z5V1FX|XRZSGC|L39Q1R|1OT5XB|84NY6I|IXKYXQ|KY2U4G|F13S00|CZRR3E|ZIAVU0|DU2BAB|27KBZ8|XWBB7G|09V69R|LTXJ4U|8GP3EM|P3WVAX| |1IPOKL|9EIG2Q|UQJV00|RXJGCK|X20VBH|CZB7SQ|THZ95A|V90YSH|9QTKCW|0RLYJO|WSTNYK|UXZYST|WT8OHL|KE31OO|C0ZKRE|9VSDJF|6Z3JAR|RR0KMB|R3J61U|EPNRZL| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct=dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark, I get this error: 12G of driver and executor memory) {noformat} pyspark --executor-memory 12G --driver-memory 12G In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6| |L2DC9N|N0NQSH|N3SMU0|VRSSPM|7TGRZ9|1FP90A|Z9KB0U|CWOH6I|O2WNSY|IJEUNA|MTJQXG|CAT0VD|5SL8A0|R6SX6H|9ZSVL1|HWPTBR|4SBQPN|4GPD0Z|ZQ72K5|EIVYSE| |X6MH6R|VM5M86|ZV1H22|Z5V1FX|XRZSGC|L39Q1R|1OT5XB|84NY6I|IXKYXQ|KY2U4G|F13S00|CZRR3E|ZIAVU0|DU2BAB|27KBZ8|XWBB7G|09V69R|LTXJ4U|8GP3EM|P3WVAX| |1IPOKL|9EIG2Q|UQJV00|RXJGCK|X20VBH|CZB7SQ|THZ95A|V90YSH|9QTKCW|0RLYJO|WSTNYK|UXZYST|WT8OHL|KE31OO|C0ZKRE|9VSDJF|6Z3JAR|RR0KMB|R3J61U|EPNRZL| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ {noformat} was: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. # Script {noformat} logger = sc._jvm.org.apache.log4j
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct=dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: 12G of driver and executor memory) {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct = dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() ct=dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31| |6UPQWB|BEV7YK|Q0JHJG|4C91TM|7HBK81|RIKZ9D|ZK96YJ|O4KZ48|GUYUHO|GYYO8P|4O1QUM|74I38Z|CAXQDE|URVY7R|PQ4WM4|4QOQ81|4PPV8B|SWKFCD|S8TC2W|QTJIS6|
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. Here is a minimal example that reproduces the bug on my machine: h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() # if this line is commented out, no memory leak will be reported ct = dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|
[jira] [Updated] (SPARK-12519) "Managed memory leak detected" when using distinct on PySpark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-12519: - Description: After running the distinct() method to transform a DataFrame, subsequent actions like count() and show() may report a managed memory leak. Here is a minimal example that reproduces the bug on my machine: h1. Script {noformat} logger = sc._jvm.org.apache.log4j logger.LogManager.getLogger("org"). setLevel( logger.Level.WARN ) logger.LogManager.getLogger("akka").setLevel( logger.Level.WARN ) import string import random def id_generator(size=6, chars=string.ascii_uppercase + string.digits): return ''.join(random.choice(chars) for _ in range(size)) nrow = 8 ncol = 20 ndrow = 4 # number distinct rows tmp = [id_generator() for i in xrange(ndrow*ncol)] tmp = [tuple(tmp[ncol*(i % ndrow)+0:ncol*(i % ndrow)+ncol]) for i in xrange(nrow)] dat = sc.parallelize(tmp,1000).toDF() dat = dat.distinct() # if this line is commented out, no memory leak will be reported # dat = dat.rdd.distinct().toDF() # if this line is used instead of the above, no leak ct = dat.count() print ct # memory leak warning prints at this point in the code dat.show() {noformat} h1. Output When this script is run in PySpark (with IPython kernel), I get this error: {noformat} $ pyspark --executor-memory 12G --driver-memory 12G Python 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015, 09:33:12) Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. <<<... usual loading info...>>> Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12) SparkContext available as sc, SQLContext available as sqlContext. In [1]: execfile('bugtest.py') 4 15/12/24 09:33:14 ERROR Executor: Managed memory leak detected; size = 16777216 bytes, TID = 2202 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |_1|_2|_3|_4|_5|_6|_7|_8|_9| _10| _11| _12| _13| _14| _15| _16| _17| _18| _19| _20| +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ |83I981|09B1ZK|J5UB1A|BPYI80|7JTIMU|HVPQVY|XS4YM2|6N4YO3|AB9GQZ|92RCHR|1N46EU|THPZFH|5IXNR1|KL4LGD|B0S50O|DZH5QP|FKTHHF|MLOCTD|ZVV5BY|D76KRK| |BNLSVC|CYYYMD|W6ZXF6|Z0QXDT|4JPRX6|YSXIBK|WCB6YD|C86MPS|ZRA42Z|8W8GX8|2DW3AA|ZZ1U0O|EVXX3L|683UOL|5M6TOZ|PI4QX8|6V7SOS|THQVVJ|0ULB14|DJ2LP5| |IZYG7Q|Q0NCUG|0FSTPN|UVT8Y6|TBAEF6|5CGN50|WNGOSB|NX2Y8R|XWPW7Y|WPTLIV|NPF00K|92YSNO|FP50AU|CW0K3K|8ULT74|SZM6HK|4XPQU9|L109TB|02X1UC|TV8BLZ| |S7AWK6|7DQ8JP|YSIHVQ|1NKN5G|UOD1TN|ZSL6K4|86SDUW|NHLK9P|Z2ZBFL|QTOA89|D6D1NK|UXUJMG|B0A0ZF|94HB2S|HGLX19|VCVF05|HMAXNE|Y265LD|DHNR78|9L23XR| |U6JCLP|PKJEOB|66C408|HNAUQK|1Q9O2X|NFW976|YLAXD4|0XC334|NMKW62|W297XR|WL9KMG|8K1P94|T5P7LP|WAQ7PT|Q5JYG0|2A9H44|9DOW5P|9SOPFH|M0NNK5|W877FV| |3M39A1|K97EL6|7JFM9G|23I3JT|FIS25Z|HIY6VN|2ORNRG|MTGYMT|32IEH8|RX41EH|EJSSKX|H6QY8J|8G0R0H|AAPYPI|HDEVZ4|WP3VCW|2KNQZ0|U8V254|61T6SH|VJJP4L| |XT3CON|WG8XST|KKJ67T|5RBQB0|OC4LJT|GYSIBI|XGVGUP|8RND4A|38CY23|W3Q26Z|K0ARWU|FLA3O7|I3DGN7|IY080I|HAQW3T|EQDQHD|1Z8E3X|I0J5WN|P4B6IO|1S23KL| |4GMPF8|FFZLKK|Y4UW1Q|AF5J2H|VQ32TO|VMU7PG|WS66ZH|VXSYVK|S0GVCY|OL5I4Q|LFB98K|BCQVZK|XW03W6|F5YGTS|NTYCKZ|JTJ5YY|DR0VSC|KIUJMN|HCPYS4|QG9WYL| |USOIHJ|HPGNXC|DIGTPY|BL0QZ4|2957GI|8A7EC5|GOMEFU|568QPG|6EA6Z2|W7P0Z8|TSP1BF|XXYS8Q|TMN7OA|3ZL2R4|7W1856|DS3LHW|QH32TF|3Y7XPC|EUO5O6|95CIMH| |0CQR4E|ZV8SYE|UZNOLC|19JG2Q|G4RJVC|D2YUGB|HUKQUK|T0HSQH|9K0B9T|EHVBJF|R07A6C|6LS1FL|1NWXKJ|X7TIWZ|MPVWCO|RSO4F9|J5DQG6|AGMXTS|MFFRMX|PEDHGU| |1LQTDY|JV1HJY|7OH9HL|0AUWC7|LQFF5G|DUK4GW|HU6VLJ|PHY36G|BLMOYU|PY7E64|Y6XHYS|3IA38F|RF4LQ5|PIXEM2|0H5GIW|6V3M9C|0VBIUC|U4ZWRH|68M496|6UUVWZ| |B7A7TT|9K5MRI|8CJWX2|YUZ8SY|JLB0MX|3JNIN6|PJP0S5|9W7N5C|LIJSXB|488P8Y|PHWN5N|E6TF76|FGYZQ2|MGDN65|YNLUJE|5D6455|JI4J2K|C3J8K8|BTJ131|D5C7CD| |G9AKQ5|UPEQDN|JAWFI2|I0EKX2|YG8TN4|8NNJBO|X3GMYR|RXG2RX|CRS9US|53VX2Q|S72E08|H5PR14|JRDDMT|Q8G6PR|KOJA0W|1U4AX8|844N9D|SKN5F7|H0C29Z|7U7GHH| |A10ZUQ|HEI32J|VP99PD|44UP47|4W5BPO|X0QE8Q|H3UQVM|47VU9U|3AUPR8|TCGT7L|65WLUU|6PX6IW|5NCTC7|ES2S38|T86EI9|G20RFI|SX2V3V|5XT724|HV8HVS|T3JYJD| |3USH3X|NHXB4D|QPL3QC|8CN92J|MJF9JZ|DFA2IV|XT7C4S|CUB4IJ|4BD3OR|T3EK2S|V81146|LWXTMJ|PCVJ5N|R8H3H6|0W5DLU|GVAO4D|I7SNKJ|6TLMAV|E57PMA|OGCVQM| |FZEDSN|WO4JEN|000HBA|HA2GAN|5ROPXM|5K6NUG|2HWCJ0|OPX5AT|6PT5ZV|HGB74S|FCQT9S|NNODZP|G0ZMSJ|SHIFDQ|MYSHAT|KZDNA4|M25MPR|4XD9J9|JBFZZ0|XLIE31|