[ https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168786#comment-14168786 ]
cocoatomo commented on SPARK-3910: ---------------------------------- Thank you for the comment. I am trying it at $SPARK_HOME. (Executing "./bin/run-tests" command shows this.) In addition, it is strange that a command {noformat} ./bin/pyspark python/pyspark/mllib/classification.py {noformat} fails with "numpy ImportError". So, my environment have some trouble (sys.path is suspicious) and at least we have some difference between environments where PySpark runs. I set up my environment using virtualenvwrapper with Python 2.6.8 (default python executable on Mac OS X 10.9.5). ImportError mentioned in this issue occurred on this environment. For comparison, I tried testing on other environment which Python version is 2.7.8, then got a same error. Is there some difference between our environments? > ./python/pyspark/mllib/classification.py doctests fails with module name > pollution > ---------------------------------------------------------------------------------- > > Key: SPARK-3910 > URL: https://issues.apache.org/jira/browse/SPARK-3910 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 1.2.0 > Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, > Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, > argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, > pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, > unittest2==0.5.1, wsgiref==0.1.2 > Reporter: cocoatomo > Labels: pyspark, testing > > In ./python/run-tests script, we run the doctests in > ./pyspark/mllib/classification.py. > The output is as following: > {noformat} > $ ./python/run-tests > ... > Running test: pyspark/mllib/classification.py > Traceback (most recent call last): > File "pyspark/mllib/classification.py", line 20, in <module> > import numpy > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py", > line 170, in <module> > from . import add_newdocs > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py", > line 13, in <module> > from numpy.lib import add_newdoc > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py", > line 8, in <module> > from .type_check import * > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py", > line 11, in <module> > import numpy.core.numeric as _nx > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py", > line 46, in <module> > from numpy.testing import Tester > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py", > line 13, in <module> > from .utils import * > File > "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py", > line 15, in <module> > from tempfile import mkdtemp > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py", > line 34, in <module> > from random import Random as _Random > File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", > line 24, in <module> > from pyspark.rdd import RDD > File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line > 51, in <module> > from pyspark.context import SparkContext > File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line > 22, in <module> > from tempfile import NamedTemporaryFile > ImportError: cannot import name NamedTemporaryFile > 0.07 real 0.04 user 0.02 sys > Had test failures; see logs. > {noformat} > The problem is a cyclic import of tempfile module. > The cause of it is that pyspark.mllib.random module exists in the directory > where pyspark.mllib.classification module exists. > classification module imports numpy module, and then numpy module imports > tempfile module from its inside. > Now the first entry sys.path is the directory "./python/pyspark/mllib" (where > the executed file "classification.py" exists), so tempfile module imports > pyspark.mllib.random module (not the standard library "random" module). > Finally, import chains reach tempfile again, then a cyclic import is formed. > Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile > → (cyclic import!!) > Furthermore, stat module is in a standard library, and pyspark.mllib.stat > module exists. This also may be troublesome. > commit: 0e8203f4fb721158fb27897680da476174d24c4b > A fundamental solution is to avoid using module names used by standard > libraries (currently "random" and "stat"). > A difficulty of this solution is to rename pyspark.mllib.random and > pyspark.mllib.stat, which may be already used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org