classification.py doctests fails with module name pollution

Josh Rosen (JIRA) Sat, 11 Oct 2014 11:21:55 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168286#comment-14168286
 ]


Josh Rosen commented on SPARK-3910:
-----------------------------------

This seems to work for me.  If my current working directory is $SPARK_HOME and 
I run

{code}
./bin/pyspark python/pyspark/mllib/classification.py
{code}

then I don't see any circular import problems.  Widely-used libraries like 
NumPy declare modules that shadow the built-ins (such as {{np.random}}), so I 
don't think that this is impossible.

Are you trying to run {{classification.py}} from inside of the 
{{python/pyspark/mllib}} directory?

> ./python/pyspark/mllib/classification.py doctests fails with module name 
> pollution
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-3910
>                 URL: https://issues.apache.org/jira/browse/SPARK-3910
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
> Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
> argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
> pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
> unittest2==0.5.1, wsgiref==0.1.2
>            Reporter: cocoatomo
>              Labels: pyspark, testing
>
> In ./python/run-tests script, we run the doctests in 
> ./pyspark/mllib/classification.py.
> The output is as following:
> {noformat}
> $ ./python/run-tests
> ...
> Running test: pyspark/mllib/classification.py
> Traceback (most recent call last):
>   File "pyspark/mllib/classification.py", line 20, in <module>
>     import numpy
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py",
>  line 170, in <module>
>     from . import add_newdocs
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py",
>  line 13, in <module>
>     from numpy.lib import add_newdoc
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py",
>  line 8, in <module>
>     from .type_check import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py",
>  line 11, in <module>
>     import numpy.core.numeric as _nx
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py",
>  line 46, in <module>
>     from numpy.testing import Tester
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py",
>  line 13, in <module>
>     from .utils import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py",
>  line 15, in <module>
>     from tempfile import mkdtemp
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py",
>  line 34, in <module>
>     from random import Random as _Random
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", 
> line 24, in <module>
>     from pyspark.rdd import RDD
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 
> 51, in <module>
>     from pyspark.context import SparkContext
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 
> 22, in <module>
>     from tempfile import NamedTemporaryFile
> ImportError: cannot import name NamedTemporaryFile
>         0.07 real         0.04 user         0.02 sys
> Had test failures; see logs.
> {noformat}
> The problem is a cyclic import of tempfile module.
> The cause of it is that pyspark.mllib.random module exists in the directory 
> where pyspark.mllib.classification module exists.
> classification module imports numpy module, and then numpy module imports 
> tempfile module from its inside.
> Now the first entry sys.path is the directory "./python/pyspark/mllib" (where 
> the executed file "classification.py" exists), so tempfile module imports 
> pyspark.mllib.random module (not the standard library "random" module).
> Finally, import chains reach tempfile again, then a cyclic import is formed.
> Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
> → (cyclic import!!)
> Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
> module exists. This also may be troublesome.
> commit: 0e8203f4fb721158fb27897680da476174d24c4b
> A fundamental solution is to avoid using module names used by standard 
> libraries (currently "random" and "stat").
> A difficulty of this solution is to rename pyspark.mllib.random and 
> pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

Reply via email to