Yeah, I do think it would be worth explicitly stating this in the docs. I was going to try to edit the docs myself and submit a pull request, but I'm having trouble building the docs from github. If anyone else wants to do this, here is approximately what I would say:
(To be added to http://spark.apache.org/docs/latest/configuration.html#environment-variables ) "Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties> for more information." I might take another crack at building the docs myself if nobody beats me to this. Andrew On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com> wrote: > Glad you got it going! It's wasn't very obvious what needed to be set, > maybe it is worth explicitly stating this in the docs since it seems to > have come up a couple times before too. > > Bryan > > On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > >> Actually, I just found this [ >> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of >> googling and reading leads me to believe that the preferred way to change >> the yarn environment is to edit the spark-defaults.conf file by adding this >> line: >> spark.yarn.appMasterEnv.PYSPARK_PYTHON /path/to/python >> >> While both this solution and the solution from my prior email work, I >> believe this is the preferred solution. >> >> Sorry for the flurry of emails. Again, thanks for all the help! >> >> Andrew >> >> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner < >> andrewweiner2...@u.northwestern.edu> wrote: >> >>> I finally got the pi.py example to run in yarn cluster mode. This was >>> the key insight: >>> https://issues.apache.org/jira/browse/SPARK-9229 >>> >>> I had to set SPARK_YARN_USER_ENV in spark-env.sh: >>> export >>> SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python" >>> >>> This caused the PYSPARK_PYTHON environment variable to be used in my >>> yarn environment in cluster mode. >>> >>> Thank you for all your help! >>> >>> Best, >>> Andrew >>> >>> >>> >>> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner < >>> andrewweiner2...@u.northwestern.edu> wrote: >>> >>>> I tried playing around with my environment variables, and here is an >>>> update. >>>> >>>> When I run in cluster mode, my environment variables do not persist >>>> throughout the entire job. >>>> For example, I tried creating a local copy of HADOOP_CONF_DIR in >>>> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the >>>> variable: >>>> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf >>>> >>>> Later, when we print the environment variables in the python code, I >>>> see this: >>>> >>>> ('HADOOP_CONF_DIR', '/etc/hadoop/conf') >>>> >>>> However, when I run in client mode, I see this: >>>> >>>> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf') >>>> >>>> Furthermore, if I omit that environment variable from spark-env.sh >>>> altogether, I get the expected error in both client and cluster mode: >>>> >>>> When running with master 'yarn' >>>> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. >>>> >>>> This suggests that my environment variables are being used when I first >>>> submit the job, but at some point during the job, my environment variables >>>> are thrown out and someone's (yarn's?) environment variables are being >>>> used. >>>> >>>> Andrew >>>> >>>> >>>> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner < >>>> andrewweiner2...@u.northwestern.edu> wrote: >>>> >>>>> Indeed! Here is the output when I run in cluster mode: >>>>> >>>>> Traceback (most recent call last): >>>>> File "pi.py", line 22, in ? >>>>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+ >>>>> RuntimeError: >>>>> (2, 4, 3, 'final', 0) >>>>> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', >>>>> '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), >>>>> ('PYTHONUNBUFFERED', 'YES')] >>>>> >>>>> As we suspected, it is using Python 2.4 >>>>> >>>>> One thing that surprises me is that PYSPARK_PYTHON is not showing up in >>>>> the list, even though I am setting it and exporting it in spark-submit >>>>> *and* in spark-env.sh. Is there somewhere else I need to set this >>>>> variable? Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR? >>>>> >>>>> Andrew >>>>> >>>>> >>>>> >>>>> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> >>>>> wrote: >>>>> >>>>>> It seems like it could be the case that some other Python version is >>>>>> being invoked. To make sure, can you add something like this to the top >>>>>> of >>>>>> the .py file you are submitting to get some more info about how the >>>>>> application master is configured? >>>>>> >>>>>> import sys, os >>>>>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+ >>>>>> str([(k,os.environ[k]) for k in os.environ if "PY" in k])) >>>>>> >>>>>> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner < >>>>>> andrewweiner2...@u.northwestern.edu> wrote: >>>>>> >>>>>>> Hi Bryan, >>>>>>> >>>>>>> I ran "$> python --version" on every node on the cluster, and it is >>>>>>> Python 2.7.8 for every single one. >>>>>>> >>>>>>> When I try to submit the Python example in client mode >>>>>>> * ./bin/spark-submit --master yarn --deploy-mode client >>>>>>> --driver-memory 4g --executor-memory 2g --executor-cores 1 >>>>>>> ./examples/src/main/python/pi.py 10* >>>>>>> That's when I get this error that I mentioned: >>>>>>> >>>>>>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in >>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException: >>>>>>> Error from python worker: >>>>>>> python: module pyspark.daemon not found >>>>>>> PYTHONPATH was: >>>>>>> >>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp >>>>>>> >>>>>>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip >>>>>>> java.io.EOFException >>>>>>> at java.io.DataInputStream.readInt(DataInputStream.java:392) >>>>>>> at >>>>>>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) >>>>>>> at [....] >>>>>>> >>>>>>> followed by several more similar errors that also say: >>>>>>> Error from python worker: >>>>>>> python: module pyspark.daemon not found >>>>>>> >>>>>>> >>>>>>> Even though the default python appeared to be correct, I just went >>>>>>> ahead and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the >>>>>>> path of >>>>>>> the default python binary executable. After making this change I was >>>>>>> able >>>>>>> to run the job successfully in client! That is, this appeared to fix >>>>>>> the >>>>>>> "pyspark.daemon not found" error when running in client mode. >>>>>>> >>>>>>> However, when running in cluster mode, I am still getting the same >>>>>>> syntax error: >>>>>>> >>>>>>> Traceback (most recent call last): >>>>>>> File "pi.py", line 24, in ? >>>>>>> from pyspark import SparkContext >>>>>>> File >>>>>>> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", >>>>>>> line 61 >>>>>>> indent = ' ' * (min(len(m) for m in indents) if indents else 0) >>>>>>> ^ >>>>>>> SyntaxError: invalid syntax >>>>>>> >>>>>>> Is it possible that the PYSPARK_PYTHON environment variable is ignored >>>>>>> when jobs are submitted in cluster mode? It seems that Spark or Yarn >>>>>>> is going behind my back, so to speak, and using some older version of >>>>>>> python I didn't even know was installed. >>>>>>> >>>>>>> Thanks again for all your help thus far. We are getting close.... >>>>>>> >>>>>>> Andrew >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cutl...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Andrew, >>>>>>>> >>>>>>>> There are a couple of things to check. First, is Python 2.7 the >>>>>>>> default version on all nodes in the cluster or is it an alternate >>>>>>>> install? >>>>>>>> Meaning what is the output of this command "$> python --version" If >>>>>>>> it is >>>>>>>> an alternate install, you could set the environment variable " >>>>>>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in >>>>>>>> both driver and workers (default is python). >>>>>>>> >>>>>>>> Did you try to submit the Python example under client mode? >>>>>>>> Otherwise, the command looks fine, you don't use the --class option for >>>>>>>> submitting python files >>>>>>>> * ./bin/spark-submit --master yarn --deploy-mode client >>>>>>>> --driver-memory 4g --executor-memory 2g --executor-cores 1 >>>>>>>> ./examples/src/main/python/pi.py 10* >>>>>>>> >>>>>>>> That is a good sign that local jobs and Java examples work, >>>>>>>> probably just a small configuration issue :) >>>>>>>> >>>>>>>> Bryan >>>>>>>> >>>>>>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner < >>>>>>>> andrewweiner2...@u.northwestern.edu> wrote: >>>>>>>> >>>>>>>>> Thanks for your continuing help. Here is some additional info. >>>>>>>>> >>>>>>>>> *OS/architecture* >>>>>>>>> output of *cat /proc/version*: >>>>>>>>> Linux version 2.6.18-400.1.1.el5 ( >>>>>>>>> mockbu...@x86-012.build.bos.redhat.com) >>>>>>>>> >>>>>>>>> output of *lsb_release -a*: >>>>>>>>> LSB Version: >>>>>>>>> >>>>>>>>> :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch >>>>>>>>> Distributor ID: RedHatEnterpriseServer >>>>>>>>> Description: Red Hat Enterprise Linux Server release 5.11 >>>>>>>>> (Tikanga) >>>>>>>>> Release: 5.11 >>>>>>>>> Codename: Tikanga >>>>>>>>> >>>>>>>>> *Running a local job* >>>>>>>>> I have confirmed that I can successfully run python jobs using >>>>>>>>> bin/spark-submit --master local[*] >>>>>>>>> Specifically, this is the command I am using: >>>>>>>>> *./bin/spark-submit --master local[8] >>>>>>>>> ./examples/src/main/python/wordcount.py >>>>>>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md* >>>>>>>>> And it works! >>>>>>>>> >>>>>>>>> *Additional info* >>>>>>>>> I am also able to successfully run the Java SparkPi example using >>>>>>>>> yarn in cluster mode using this command: >>>>>>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi >>>>>>>>> --master yarn --deploy-mode cluster --driver-memory 4g >>>>>>>>> --executor-memory 2g --executor-cores 1 >>>>>>>>> lib/spark-examples*.jar >>>>>>>>> 10* >>>>>>>>> This Java job also runs successfully when I change --deploy-mode >>>>>>>>> to client. The fact that I can run Java jobs in cluster mode makes me >>>>>>>>> thing that everything is installed correctly--is that a valid >>>>>>>>> assumption? >>>>>>>>> >>>>>>>>> The problem remains that I cannot submit python jobs. Here is the >>>>>>>>> command that I am using to try to submit python jobs: >>>>>>>>> * ./bin/spark-submit --master yarn --deploy-mode cluster >>>>>>>>> --driver-memory 4g --executor-memory 2g --executor-cores 1 >>>>>>>>> ./examples/src/main/python/pi.py 10* >>>>>>>>> Does that look like a correct command? I wasn't sure what to put >>>>>>>>> for --class so I omitted it. At any rate, the result of the above >>>>>>>>> command >>>>>>>>> is a syntax error, similar to the one I posted in the original email: >>>>>>>>> >>>>>>>>> Traceback (most recent call last): >>>>>>>>> File "pi.py", line 24, in ? >>>>>>>>> from pyspark import SparkContext >>>>>>>>> File >>>>>>>>> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", >>>>>>>>> line 61 >>>>>>>>> indent = ' ' * (min(len(m) for m in indents) if indents else 0) >>>>>>>>> ^ >>>>>>>>> SyntaxError: invalid syntax >>>>>>>>> >>>>>>>>> >>>>>>>>> This really looks to me like a problem with the python version. >>>>>>>>> Python 2.4 would throw this syntax error but Python 2.7 would not. >>>>>>>>> And yet >>>>>>>>> I am using Python 2.7.8. Is there any chance that Spark or Yarn is >>>>>>>>> somehow >>>>>>>>> using an older version of Python without my knowledge? >>>>>>>>> >>>>>>>>> Finally, when I try to run the same command in client mode... >>>>>>>>> * ./bin/spark-submit --master yarn --deploy-mode client >>>>>>>>> --driver-memory 4g --executor-memory 2g --executor-cores 1 >>>>>>>>> ./examples/src/main/python/pi.py 10* >>>>>>>>> I get the error I mentioned in the prior email: >>>>>>>>> Error from python worker: >>>>>>>>> python: module pyspark.daemon not found >>>>>>>>> >>>>>>>>> Any thoughts? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Andrew >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> This could be an environment issue, could you give more details >>>>>>>>>> about the OS/architecture that you are using? If you are sure >>>>>>>>>> everything >>>>>>>>>> is installed correctly on each node following the guide on "Running >>>>>>>>>> Spark >>>>>>>>>> on Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html >>>>>>>>>> and that the spark assembly jar is reachable, then I would check to >>>>>>>>>> see if >>>>>>>>>> you can submit a local job to just run on one node. >>>>>>>>>> >>>>>>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner < >>>>>>>>>> andrewweiner2...@u.northwestern.edu> wrote: >>>>>>>>>> >>>>>>>>>>> Now for simplicity I'm testing with wordcount.py from the >>>>>>>>>>> provided examples, and using Spark 1.6.0 >>>>>>>>>>> >>>>>>>>>>> The first error I get is: >>>>>>>>>>> >>>>>>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load >>>>>>>>>>> native gpl library >>>>>>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in >>>>>>>>>>> java.library.path >>>>>>>>>>> at >>>>>>>>>>> java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864) >>>>>>>>>>> at [....] >>>>>>>>>>> >>>>>>>>>>> A bit lower down, I see this error: >>>>>>>>>>> >>>>>>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 >>>>>>>>>>> in stage 0.0 (TID 0, mundonovo-priv): >>>>>>>>>>> org.apache.spark.SparkException: >>>>>>>>>>> Error from python worker: >>>>>>>>>>> python: module pyspark.daemon not found >>>>>>>>>>> PYTHONPATH was: >>>>>>>>>>> >>>>>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip >>>>>>>>>>> java.io.EOFException >>>>>>>>>>> at >>>>>>>>>>> java.io.DataInputStream.readInt(DataInputStream.java:392) >>>>>>>>>>> at [....] >>>>>>>>>>> >>>>>>>>>>> And then a few more similar pyspark.daemon not found errors... >>>>>>>>>>> >>>>>>>>>>> Andrew >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> I know that older versions of Spark could not run PySpark on >>>>>>>>>>>> YARN in cluster mode. I'm not sure if that is fixed in 1.6.0 >>>>>>>>>>>> though. Can >>>>>>>>>>>> you try setting deploy-mode option to "client" when calling >>>>>>>>>>>> spark-submit? >>>>>>>>>>>> >>>>>>>>>>>> Bryan >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran < >>>>>>>>>>>> andrewweiner2...@u.northwestern.edu> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> When I try to submit a python job using spark-submit (using >>>>>>>>>>>>> --master yarn >>>>>>>>>>>>> --deploy-mode cluster), I get the following error: >>>>>>>>>>>>> >>>>>>>>>>>>> /Traceback (most recent call last): >>>>>>>>>>>>> File "loss_rate_by_probe.py", line 15, in ? >>>>>>>>>>>>> from pyspark import SparkContext >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py", >>>>>>>>>>>>> line 41, in ? >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py", >>>>>>>>>>>>> line 219 >>>>>>>>>>>>> with SparkContext._lock: >>>>>>>>>>>>> ^ >>>>>>>>>>>>> SyntaxError: invalid syntax/ >>>>>>>>>>>>> >>>>>>>>>>>>> This is very similar to this post from 2014 >>>>>>>>>>>>> < >>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html >>>>>>>>>>>>> > >>>>>>>>>>>>> , but unlike that person I am using Python 2.7.8. >>>>>>>>>>>>> >>>>>>>>>>>>> Here is what I'm using: >>>>>>>>>>>>> Spark 1.3.1 >>>>>>>>>>>>> Hadoop 2.4.0.2.1.5.0-695 >>>>>>>>>>>>> Python 2.7.8 >>>>>>>>>>>>> >>>>>>>>>>>>> Another clue: I also installed Spark 1.6.0 and tried to >>>>>>>>>>>>> submit the same >>>>>>>>>>>>> job. I got a similar error: >>>>>>>>>>>>> >>>>>>>>>>>>> /Traceback (most recent call last): >>>>>>>>>>>>> File "loss_rate_by_probe.py", line 15, in ? >>>>>>>>>>>>> from pyspark import SparkContext >>>>>>>>>>>>> File >>>>>>>>>>>>> >>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py", >>>>>>>>>>>>> line 61 >>>>>>>>>>>>> indent = ' ' * (min(len(m) for m in indents) if indents >>>>>>>>>>>>> else 0) >>>>>>>>>>>>> ^ >>>>>>>>>>>>> SyntaxError: invalid syntax/ >>>>>>>>>>>>> >>>>>>>>>>>>> Any thoughts? >>>>>>>>>>>>> >>>>>>>>>>>>> Andrew >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> View this message in context: >>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html >>>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>>>>>> Nabble.com. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >