Actually, I just found this [
https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
googling and reading leads me to believe that the preferred way to change
the yarn environment is to edit the spark-defaults.conf file by adding this
line:
spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python

While both this solution and the solution from my prior email work, I
believe this is the preferred solution.

Sorry for the flurry of emails.  Again, thanks for all the help!

Andrew

On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
andrewweiner2...@u.northwestern.edu> wrote:

> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
>> I tried playing around with my environment variables, and here is an
>> update.
>>
>> When I run in cluster mode, my environment variables do not persist
>> throughout the entire job.
>> For example, I tried creating a local copy of HADOOP_CONF_DIR in
>> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
>> variable:
>> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>>
>> Later, when we print the environment variables in the python code, I see
>> this:
>>
>> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>>
>> However, when I run in client mode, I see this:
>>
>> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>>
>> Furthermore, if I omit that environment variable from spark-env.sh 
>> altogether, I get the expected error in both client and cluster mode:
>>
>> When running with master 'yarn'
>> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>>
>> This suggests that my environment variables are being used when I first 
>> submit the job, but at some point during the job, my environment variables 
>> are thrown out and someone's (yarn's?) environment variables are being used.
>>
>> Andrew
>>
>>
>> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
>> andrewweiner2...@u.northwestern.edu> wrote:
>>
>>> Indeed!  Here is the output when I run in cluster mode:
>>>
>>> Traceback (most recent call last):
>>>   File "pi.py", line 22, in ?
>>>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>> RuntimeError:
>>> (2, 4, 3, 'final', 0)
>>> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', 
>>> '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'),
>>>  ('PYTHONUNBUFFERED', 'YES')]
>>>
>>> As we suspected, it is using Python 2.4
>>>
>>> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the 
>>> list, even though I am setting it and exporting it in spark-submit *and* in 
>>> spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe 
>>> in one of the hadoop conf files in my HADOOP_CONF_DIR?
>>>
>>> Andrew
>>>
>>>
>>>
>>> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>>>
>>>> It seems like it could be the case that some other Python version is
>>>> being invoked.  To make sure, can you add something like this to the top of
>>>> the .py file you are submitting to get some more info about how the
>>>> application master is configured?
>>>>
>>>> import sys, os
>>>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>>>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>>>>
>>>> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
>>>> andrewweiner2...@u.northwestern.edu> wrote:
>>>>
>>>>> Hi Bryan,
>>>>>
>>>>> I ran "$> python --version" on every node on the cluster, and it is
>>>>> Python 2.7.8 for every single one.
>>>>>
>>>>> When I try to submit the Python example in client mode
>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>> ./examples/src/main/python/pi.py     10*
>>>>> That's when I get this error that I mentioned:
>>>>>
>>>>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>> Error from python worker:
>>>>>   python: module pyspark.daemon not found
>>>>> PYTHONPATH was:
>>>>>
>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>>>>
>>>>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>>>>> java.io.EOFException
>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>         at
>>>>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>>>>         at [....]
>>>>>
>>>>> followed by several more similar errors that also say:
>>>>> Error from python worker:
>>>>>   python: module pyspark.daemon not found
>>>>>
>>>>>
>>>>> Even though the default python appeared to be correct, I just went
>>>>> ahead and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path 
>>>>> of
>>>>> the default python binary executable.  After making this change I was able
>>>>> to run the job successfully in client!  That is, this appeared to fix the
>>>>> "pyspark.daemon not found" error when running in client mode.
>>>>>
>>>>> However, when running in cluster mode, I am still getting the same
>>>>> syntax error:
>>>>>
>>>>> Traceback (most recent call last):
>>>>>   File "pi.py", line 24, in ?
>>>>>     from pyspark import SparkContext
>>>>>   File 
>>>>> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", 
>>>>> line 61
>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>                                                   ^
>>>>> SyntaxError: invalid syntax
>>>>>
>>>>> Is it possible that the PYSPARK_PYTHON environment variable is ignored 
>>>>> when jobs are submitted in cluster mode?  It seems that Spark or Yarn is 
>>>>> going behind my back, so to speak, and using some older version of python 
>>>>> I didn't even know was installed.
>>>>>
>>>>> Thanks again for all your help thus far.  We are getting close....
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cutl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> There are a couple of things to check.  First, is Python 2.7 the
>>>>>> default version on all nodes in the cluster or is it an alternate 
>>>>>> install?
>>>>>> Meaning what is the output of this command "$>  python --version"  If it 
>>>>>> is
>>>>>> an alternate install, you could set the environment variable "
>>>>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both
>>>>>> driver and workers (default is python).
>>>>>>
>>>>>> Did you try to submit the Python example under client mode?
>>>>>> Otherwise, the command looks fine, you don't use the --class option for
>>>>>> submitting python files
>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>
>>>>>> That is a good sign that local jobs and Java examples work, probably
>>>>>> just a small configuration issue :)
>>>>>>
>>>>>> Bryan
>>>>>>
>>>>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>>>>> andrewweiner2...@u.northwestern.edu> wrote:
>>>>>>
>>>>>>> Thanks for your continuing help.  Here is some additional info.
>>>>>>>
>>>>>>> *OS/architecture*
>>>>>>> output of *cat /proc/version*:
>>>>>>> Linux version 2.6.18-400.1.1.el5 (
>>>>>>> mockbu...@x86-012.build.bos.redhat.com)
>>>>>>>
>>>>>>> output of *lsb_release -a*:
>>>>>>> LSB Version:
>>>>>>>  
>>>>>>> :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>>>>> Distributor ID: RedHatEnterpriseServer
>>>>>>> Description:    Red Hat Enterprise Linux Server release 5.11
>>>>>>> (Tikanga)
>>>>>>> Release:        5.11
>>>>>>> Codename:       Tikanga
>>>>>>>
>>>>>>> *Running a local job*
>>>>>>> I have confirmed that I can successfully run python jobs using
>>>>>>> bin/spark-submit --master local[*]
>>>>>>> Specifically, this is the command I am using:
>>>>>>> *./bin/spark-submit --master local[8]
>>>>>>> ./examples/src/main/python/wordcount.py
>>>>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>>>>> And it works!
>>>>>>>
>>>>>>> *Additional info*
>>>>>>> I am also able to successfully run the Java SparkPi example using
>>>>>>> yarn in cluster mode using this command:
>>>>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>>>>> 10*
>>>>>>> This Java job also runs successfully when I change --deploy-mode to
>>>>>>> client.  The fact that I can run Java jobs in cluster mode makes me 
>>>>>>> thing
>>>>>>> that everything is installed correctly--is that a valid assumption?
>>>>>>>
>>>>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>>>>> command that I am using to try to submit python jobs:
>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>>>>>   --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>> Does that look like a correct command?  I wasn't sure what to put
>>>>>>> for --class so I omitted it.  At any rate, the result of the above 
>>>>>>> command
>>>>>>> is a syntax error, similar to the one I posted in the original email:
>>>>>>>
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "pi.py", line 24, in ?
>>>>>>>     from pyspark import SparkContext
>>>>>>>   File 
>>>>>>> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py",
>>>>>>>  line 61
>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>                                                   ^
>>>>>>> SyntaxError: invalid syntax
>>>>>>>
>>>>>>>
>>>>>>> This really looks to me like a problem with the python version.
>>>>>>> Python 2.4 would throw this syntax error but Python 2.7 would not.  And 
>>>>>>> yet
>>>>>>> I am using Python 2.7.8.  Is there any chance that Spark or Yarn is 
>>>>>>> somehow
>>>>>>> using an older version of Python without my knowledge?
>>>>>>>
>>>>>>> Finally, when I try to run the same command in client mode...
>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>> ./examples/src/main/python/pi.py 10*
>>>>>>> I get the error I mentioned in the prior email:
>>>>>>> Error from python worker:
>>>>>>>   python: module pyspark.daemon not found
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Best,
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This could be an environment issue, could you give more details
>>>>>>>> about the OS/architecture that you are using?  If you are sure 
>>>>>>>> everything
>>>>>>>> is installed correctly on each node following the guide on "Running 
>>>>>>>> Spark
>>>>>>>> on Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>>> and that the spark assembly jar is reachable, then I would check to 
>>>>>>>> see if
>>>>>>>> you can submit a local job to just run on one node.
>>>>>>>>
>>>>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>>>>> andrewweiner2...@u.northwestern.edu> wrote:
>>>>>>>>
>>>>>>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>>>>>>> examples, and using Spark 1.6.0
>>>>>>>>>
>>>>>>>>> The first error I get is:
>>>>>>>>>
>>>>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>>>>> native gpl library
>>>>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in
>>>>>>>>> java.library.path
>>>>>>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>>>>         at [....]
>>>>>>>>>
>>>>>>>>> A bit lower down, I see this error:
>>>>>>>>>
>>>>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>>>>> Error from python worker:
>>>>>>>>>   python: module pyspark.daemon not found
>>>>>>>>> PYTHONPATH was:
>>>>>>>>>
>>>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>>>>> java.io.EOFException
>>>>>>>>>         at
>>>>>>>>> java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>>>>         at [....]
>>>>>>>>>
>>>>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> I know that older versions of Spark could not run PySpark on YARN
>>>>>>>>>> in cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  
>>>>>>>>>> Can you
>>>>>>>>>> try setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>>>>
>>>>>>>>>> Bryan
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>>>>> andrewweiner2...@u.northwestern.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>>>>> --master yarn
>>>>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>>>>
>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>   File
>>>>>>>>>>>
>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>>>>> line 41, in ?
>>>>>>>>>>>   File
>>>>>>>>>>>
>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>>>>> line 219
>>>>>>>>>>>     with SparkContext._lock:
>>>>>>>>>>>                     ^
>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>
>>>>>>>>>>> This is very similar to  this post from 2014
>>>>>>>>>>> <
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>>>>> >
>>>>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>>>>
>>>>>>>>>>> Here is what I'm using:
>>>>>>>>>>> Spark 1.3.1
>>>>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>>>>> Python 2.7.8
>>>>>>>>>>>
>>>>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit
>>>>>>>>>>> the same
>>>>>>>>>>> job.  I got a similar error:
>>>>>>>>>>>
>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>   File
>>>>>>>>>>>
>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>>>>> line 61
>>>>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else
>>>>>>>>>>> 0)
>>>>>>>>>>>                                                   ^
>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts?
>>>>>>>>>>>
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to