Indeed! Here is the output when I run in cluster mode: Traceback (most recent call last): File "pi.py", line 22, in ? raise RuntimeError("\n"+str(sys.version_info) +"\n"+ RuntimeError: (2, 4, 3, 'final', 0) [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
As we suspected, it is using Python 2.4 One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh. Is there somewhere else I need to set this variable? Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR? Andrew On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> wrote: > It seems like it could be the case that some other Python version is being > invoked. To make sure, can you add something like this to the top of the > .py file you are submitting to get some more info about how the application > master is configured? > > import sys, os > raise RuntimeError("\n"+str(sys.version_info) +"\n"+ > str([(k,os.environ[k]) for k in os.environ if "PY" in k])) > > On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > >> Hi Bryan, >> >> I ran "$> python --version" on every node on the cluster, and it is >> Python 2.7.8 for every single one. >> >> When I try to submit the Python example in client mode >> * ./bin/spark-submit --master yarn --deploy-mode client >> --driver-memory 4g --executor-memory 2g --executor-cores 1 >> ./examples/src/main/python/pi.py 10* >> That's when I get this error that I mentioned: >> >> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage >> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException: >> Error from python worker: >> python: module pyspark.daemon not found >> PYTHONPATH was: >> >> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp >> >> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip >> java.io.EOFException >> at java.io.DataInputStream.readInt(DataInputStream.java:392) >> at >> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) >> at [....] >> >> followed by several more similar errors that also say: >> Error from python worker: >> python: module pyspark.daemon not found >> >> >> Even though the default python appeared to be correct, I just went ahead >> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the >> default python binary executable. After making this change I was able to >> run the job successfully in client! That is, this appeared to fix the >> "pyspark.daemon not found" error when running in client mode. >> >> However, when running in cluster mode, I am still getting the same syntax >> error: >> >> Traceback (most recent call last): >> File "pi.py", line 24, in ? >> from pyspark import SparkContext >> File >> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", >> line 61 >> indent = ' ' * (min(len(m) for m in indents) if indents else 0) >> ^ >> SyntaxError: invalid syntax >> >> Is it possible that the PYSPARK_PYTHON environment variable is ignored when >> jobs are submitted in cluster mode? It seems that Spark or Yarn is going >> behind my back, so to speak, and using some older version of python I didn't >> even know was installed. >> >> Thanks again for all your help thus far. We are getting close.... >> >> Andrew >> >> >> >> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cutl...@gmail.com> wrote: >> >>> Hi Andrew, >>> >>> There are a couple of things to check. First, is Python 2.7 the default >>> version on all nodes in the cluster or is it an alternate install? Meaning >>> what is the output of this command "$> python --version" If it is an >>> alternate install, you could set the environment variable " >>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both >>> driver and workers (default is python). >>> >>> Did you try to submit the Python example under client mode? Otherwise, >>> the command looks fine, you don't use the --class option for submitting >>> python files >>> * ./bin/spark-submit --master yarn --deploy-mode client >>> --driver-memory 4g --executor-memory 2g --executor-cores 1 >>> ./examples/src/main/python/pi.py 10* >>> >>> That is a good sign that local jobs and Java examples work, probably >>> just a small configuration issue :) >>> >>> Bryan >>> >>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner < >>> andrewweiner2...@u.northwestern.edu> wrote: >>> >>>> Thanks for your continuing help. Here is some additional info. >>>> >>>> *OS/architecture* >>>> output of *cat /proc/version*: >>>> Linux version 2.6.18-400.1.1.el5 ( >>>> mockbu...@x86-012.build.bos.redhat.com) >>>> >>>> output of *lsb_release -a*: >>>> LSB Version: >>>> >>>> :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch >>>> Distributor ID: RedHatEnterpriseServer >>>> Description: Red Hat Enterprise Linux Server release 5.11 (Tikanga) >>>> Release: 5.11 >>>> Codename: Tikanga >>>> >>>> *Running a local job* >>>> I have confirmed that I can successfully run python jobs using >>>> bin/spark-submit --master local[*] >>>> Specifically, this is the command I am using: >>>> *./bin/spark-submit --master local[8] >>>> ./examples/src/main/python/wordcount.py >>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md* >>>> And it works! >>>> >>>> *Additional info* >>>> I am also able to successfully run the Java SparkPi example using yarn >>>> in cluster mode using this command: >>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi >>>> --master yarn --deploy-mode cluster --driver-memory 4g >>>> --executor-memory 2g --executor-cores 1 lib/spark-examples*.jar >>>> 10* >>>> This Java job also runs successfully when I change --deploy-mode to >>>> client. The fact that I can run Java jobs in cluster mode makes me thing >>>> that everything is installed correctly--is that a valid assumption? >>>> >>>> The problem remains that I cannot submit python jobs. Here is the >>>> command that I am using to try to submit python jobs: >>>> * ./bin/spark-submit --master yarn --deploy-mode cluster >>>> --driver-memory 4g --executor-memory 2g --executor-cores 1 >>>> ./examples/src/main/python/pi.py 10* >>>> Does that look like a correct command? I wasn't sure what to put for >>>> --class so I omitted it. At any rate, the result of the above command is a >>>> syntax error, similar to the one I posted in the original email: >>>> >>>> Traceback (most recent call last): >>>> File "pi.py", line 24, in ? >>>> from pyspark import SparkContext >>>> File >>>> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", >>>> line 61 >>>> indent = ' ' * (min(len(m) for m in indents) if indents else 0) >>>> ^ >>>> SyntaxError: invalid syntax >>>> >>>> >>>> This really looks to me like a problem with the python version. Python >>>> 2.4 would throw this syntax error but Python 2.7 would not. And yet I am >>>> using Python 2.7.8. Is there any chance that Spark or Yarn is somehow >>>> using an older version of Python without my knowledge? >>>> >>>> Finally, when I try to run the same command in client mode... >>>> * ./bin/spark-submit --master yarn --deploy-mode client >>>> --driver-memory 4g --executor-memory 2g --executor-cores 1 >>>> ./examples/src/main/python/pi.py 10* >>>> I get the error I mentioned in the prior email: >>>> Error from python worker: >>>> python: module pyspark.daemon not found >>>> >>>> Any thoughts? >>>> >>>> Best, >>>> Andrew >>>> >>>> >>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.com> >>>> wrote: >>>> >>>>> This could be an environment issue, could you give more details about >>>>> the OS/architecture that you are using? If you are sure everything is >>>>> installed correctly on each node following the guide on "Running Spark on >>>>> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and >>>>> that the spark assembly jar is reachable, then I would check to see if you >>>>> can submit a local job to just run on one node. >>>>> >>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner < >>>>> andrewweiner2...@u.northwestern.edu> wrote: >>>>> >>>>>> Now for simplicity I'm testing with wordcount.py from the provided >>>>>> examples, and using Spark 1.6.0 >>>>>> >>>>>> The first error I get is: >>>>>> >>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load >>>>>> native gpl library >>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path >>>>>> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864) >>>>>> at [....] >>>>>> >>>>>> A bit lower down, I see this error: >>>>>> >>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in >>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException: >>>>>> Error from python worker: >>>>>> python: module pyspark.daemon not found >>>>>> PYTHONPATH was: >>>>>> >>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip >>>>>> java.io.EOFException >>>>>> at java.io.DataInputStream.readInt(DataInputStream.java:392) >>>>>> at [....] >>>>>> >>>>>> And then a few more similar pyspark.daemon not found errors... >>>>>> >>>>>> Andrew >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> I know that older versions of Spark could not run PySpark on YARN in >>>>>>> cluster mode. I'm not sure if that is fixed in 1.6.0 though. Can you >>>>>>> try >>>>>>> setting deploy-mode option to "client" when calling spark-submit? >>>>>>> >>>>>>> Bryan >>>>>>> >>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran < >>>>>>> andrewweiner2...@u.northwestern.edu> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> When I try to submit a python job using spark-submit (using >>>>>>>> --master yarn >>>>>>>> --deploy-mode cluster), I get the following error: >>>>>>>> >>>>>>>> /Traceback (most recent call last): >>>>>>>> File "loss_rate_by_probe.py", line 15, in ? >>>>>>>> from pyspark import SparkContext >>>>>>>> File >>>>>>>> >>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py", >>>>>>>> line 41, in ? >>>>>>>> File >>>>>>>> >>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py", >>>>>>>> line 219 >>>>>>>> with SparkContext._lock: >>>>>>>> ^ >>>>>>>> SyntaxError: invalid syntax/ >>>>>>>> >>>>>>>> This is very similar to this post from 2014 >>>>>>>> < >>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html >>>>>>>> > >>>>>>>> , but unlike that person I am using Python 2.7.8. >>>>>>>> >>>>>>>> Here is what I'm using: >>>>>>>> Spark 1.3.1 >>>>>>>> Hadoop 2.4.0.2.1.5.0-695 >>>>>>>> Python 2.7.8 >>>>>>>> >>>>>>>> Another clue: I also installed Spark 1.6.0 and tried to submit the >>>>>>>> same >>>>>>>> job. I got a similar error: >>>>>>>> >>>>>>>> /Traceback (most recent call last): >>>>>>>> File "loss_rate_by_probe.py", line 15, in ? >>>>>>>> from pyspark import SparkContext >>>>>>>> File >>>>>>>> >>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py", >>>>>>>> line 61 >>>>>>>> indent = ' ' * (min(len(m) for m in indents) if indents else 0) >>>>>>>> ^ >>>>>>>> SyntaxError: invalid syntax/ >>>>>>>> >>>>>>>> Any thoughts? >>>>>>>> >>>>>>>> Andrew >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> View this message in context: >>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html >>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>> Nabble.com. >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >