Hi Felix,

Yeah, when I try to build the docs using jekyll build, I get a LoadError
(cannot load such file -- pygments) and I'm having trouble getting past it
at the moment.

>From what I could tell, this does not apply to YARN in client mode.  I was
able to submit jobs in client mode and they would run fine without using
the appMasterEnv property.  I even confirmed that my environment variables
persisted during the job when run in client mode.  There is something about
YARN cluster mode that uses a different environment (the YARN Application
Master environment) and requires the appMasterEnv property for setting
environment variables.

On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> ------------------------------
> From: andrewweiner2...@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutl...@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>
> Furthermore, if I omit that environment variable from spark-env.sh 
> altogether, I get the expected error in both client and cluster mode:
>
> When running with master 'yarn'
> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>
> This suggests that my environment variables are being used when I first 
> submit the job, but at some point during the job, my environment variables 
> are thrown out and someone's (yarn's?) environment variables are being used.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Indeed!  Here is the output when I run in cluster mode:
>
> Traceback (most recent call last):
>   File "pi.py", line 22, in ?
>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
> RuntimeError:
> (2, 4, 3, 'final', 0)
> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', 
> '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'),
>  ('PYTHONUNBUFFERED', 'YES')]
>
> As we suspected, it is using Python 2.4
>
> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the 
> list, even though I am setting it and exporting it in spark-submit *and* in 
> spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in 
> one of the hadoop conf files in my HADOOP_CONF_DIR?
>
> Andrew
>
>
>
> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>
> It seems like it could be the case that some other Python version is being
> invoked.  To make sure, can you add something like this to the top of the
> .py file you are submitting to get some more info about how the application
> master is configured?
>
> import sys, os
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>
> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Hi Bryan,
>
> I ran "$> python --version" on every node on the cluster, and it is Python
> 2.7.8 for every single one.
>
> When I try to submit the Python example in client mode
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> That's when I get this error that I mentioned:
>
> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>
> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>         at [....]
>
> followed by several more similar errors that also say:
> Error from python worker:
>   python: module pyspark.daemon not found
>
>
> Even though the default python appeared to be correct, I just went ahead
> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
> default python binary executable.  After making this change I was able to
> run the job successfully in client!  That is, this appeared to fix the
> "pyspark.daemon not found" error when running in client mode.
>
> However, when running in cluster mode, I am still getting the same syntax
> error:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File 
> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 
> 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
> Is it possible that the PYSPARK_PYTHON environment variable is ignored when 
> jobs are submitted in cluster mode?  It seems that Spark or Yarn is going 
> behind my back, so to speak, and using some older version of python I didn't 
> even know was installed.
>
> Thanks again for all your help thus far.  We are getting close....
>
> Andrew
>
>
>
> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>
> Hi Andrew,
>
> There are a couple of things to check.  First, is Python 2.7 the default
> version on all nodes in the cluster or is it an alternate install? Meaning
> what is the output of this command "$>  python --version"  If it is an
> alternate install, you could set the environment variable "PYSPARK_PYTHON"
> Python binary executable to use for PySpark in both driver and workers
> (default is python).
>
> Did you try to submit the Python example under client mode?  Otherwise,
> the command looks fine, you don't use the --class option for submitting
> python files
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
>
> That is a good sign that local jobs and Java examples work, probably just
> a small configuration issue :)
>
> Bryan
>
> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Thanks for your continuing help.  Here is some additional info.
>
> *OS/architecture*
> output of *cat /proc/version*:
> Linux version 2.6.18-400.1.1.el5 (mockbu...@x86-012.build.bos.redhat.com)
>
> output of *lsb_release -a*:
> LSB Version:
>  
> :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
> Distributor ID: RedHatEnterpriseServer
> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
> Release:        5.11
> Codename:       Tikanga
>
> *Running a local job*
> I have confirmed that I can successfully run python jobs using
> bin/spark-submit --master local[*]
> Specifically, this is the command I am using:
> *./bin/spark-submit --master local[8]
> ./examples/src/main/python/wordcount.py
> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
> And it works!
>
> *Additional info*
> I am also able to successfully run the Java SparkPi example using yarn in
> cluster mode using this command:
> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
> --master yarn     --deploy-mode cluster     --driver-memory 4g
> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
> 10*
> This Java job also runs successfully when I change --deploy-mode to
> client.  The fact that I can run Java jobs in cluster mode makes me thing
> that everything is installed correctly--is that a valid assumption?
>
> The problem remains that I cannot submit python jobs.  Here is the command
> that I am using to try to submit python jobs:
> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> Does that look like a correct command?  I wasn't sure what to put for
> --class so I omitted it.  At any rate, the result of the above command is a
> syntax error, similar to the one I posted in the original email:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File 
> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 
> 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
>
> This really looks to me like a problem with the python version.  Python
> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
> using an older version of Python without my knowledge?
>
> Finally, when I try to run the same command in client mode...
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py 10*
> I get the error I mentioned in the prior email:
> Error from python worker:
>   python: module pyspark.daemon not found
>
> Any thoughts?
>
> Best,
> Andrew
>
>
> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>
> This could be an environment issue, could you give more details about the
> OS/architecture that you are using?  If you are sure everything is
> installed correctly on each node following the guide on "Running Spark on
> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that
> the spark assembly jar is reachable, then I would check to see if you can
> submit a local job to just run on one node.
>
> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Now for simplicity I'm testing with wordcount.py from the provided
> examples, and using Spark 1.6.0
>
> The first error I get is:
>
> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
> library
> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>         at [....]
>
> A bit lower down, I see this error:
>
> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at [....]
>
> And then a few more similar pyspark.daemon not found errors...
>
> Andrew
>
>
>
> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>
> Hi Andrew,
>
> I know that older versions of Spark could not run PySpark on YARN in
> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
> setting deploy-mode option to "client" when calling spark-submit?
>
> Bryan
>
> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Hello,
>
> When I try to submit a python job using spark-submit (using --master yarn
> --deploy-mode cluster), I get the following error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
> line 41, in ?
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
> line 219
>     with SparkContext._lock:
>                     ^
> SyntaxError: invalid syntax/
>
> This is very similar to  this post from 2014
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
> >
> , but unlike that person I am using Python 2.7.8.
>
> Here is what I'm using:
> Spark 1.3.1
> Hadoop 2.4.0.2.1.5.0-695
> Python 2.7.8
>
> Another clue:  I also installed Spark 1.6.0 and tried to submit the same
> job.  I got a similar error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
> line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax/
>
> Any thoughts?
>
> Andrew
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Reply via email to