I have to run this to install the pre-req to get jeykyll build to work, you do need the python pygments package: (I’m on ubuntu)sudo apt-get install ruby ruby-dev make gcc nodejssudo gem install jekyll --no-rdoc --no-risudo gem install jekyll-redirect-fromsudo apt-get install python-Pygmentssudo apt-get install python-sphinxsudo gem install pygments.rb
Hope that helps!If not, I can try putting together doc change but I’d rather you could make progress :) On Mon, Jan 18, 2016 at 6:36 AM -0800, "Andrew Weiner" <andrewweiner2...@u.northwestern.edu> wrote: Hi Felix, Yeah, when I try to build the docs using jekyll build, I get a LoadError (cannot load such file -- pygments) and I'm having trouble getting past it at the moment. >From what I could tell, this does not apply to YARN in client mode. I was able to submit jobs in client mode and they would run fine without using the appMasterEnv property. I even confirmed that my environment variables persisted during the job when run in client mode. There is something about YARN cluster mode that uses a different environment (the YARN Application Master environment) and requires the appMasterEnv property for setting environment variables. On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Do you still need help on the PR? > btw, does this apply to YARN client mode? > > ------------------------------ > From: andrewweiner2...@u.northwestern.edu > Date: Sun, 17 Jan 2016 17:00:39 -0600 > Subject: Re: SparkContext SyntaxError: invalid syntax > To: cutl...@gmail.com > CC: user@spark.apache.org > > > Yeah, I do think it would be worth explicitly stating this in the docs. I > was going to try to edit the docs myself and submit a pull request, but I'm > having trouble building the docs from github. If anyone else wants to do > this, here is approximately what I would say: > > (To be added to > http://spark.apache.org/docs/latest/configuration.html#environment-variables > ) > "Note: When running Spark on YARN in cluster mode, environment variables > need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] > property in your conf/spark-defaults.conf file. Environment variables > that are set in spark-env.sh will not be reflected in the YARN > Application Master process in cluster mode. See the YARN-related Spark > Properties > <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties> > for more information." > > I might take another crack at building the docs myself if nobody beats me > to this. > > Andrew > > > On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com> wrote: > > Glad you got it going! It's wasn't very obvious what needed to be set, > maybe it is worth explicitly stating this in the docs since it seems to > have come up a couple times before too. > > Bryan > > On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > > Actually, I just found this [ > https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of > googling and reading leads me to believe that the preferred way to change > the yarn environment is to edit the spark-defaults.conf file by adding this > line: > spark.yarn.appMasterEnv.PYSPARK_PYTHON /path/to/python > > While both this solution and the solution from my prior email work, I > believe this is the preferred solution. > > Sorry for the flurry of emails. Again, thanks for all the help! > > Andrew > > On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > > I finally got the pi.py example to run in yarn cluster mode. This was the > key insight: > https://issues.apache.org/jira/browse/SPARK-9229 > > I had to set SPARK_YARN_USER_ENV in spark-env.sh: > export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python" > > This caused the PYSPARK_PYTHON environment variable to be used in my yarn > environment in cluster mode. > > Thank you for all your help! > > Best, > Andrew > > > > On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > > I tried playing around with my environment variables, and here is an > update. > > When I run in cluster mode, my environment variables do not persist > throughout the entire job. > For example, I tried creating a local copy of HADOOP_CONF_DIR in > /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the > variable: > export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf > > Later, when we print the environment variables in the python code, I see > this: > > ('HADOOP_CONF_DIR', '/etc/hadoop/conf') > > However, when I run in client mode, I see this: > > ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf') > > Furthermore, if I omit that environment variable from spark-env.sh > altogether, I get the expected error in both client and cluster mode: > > When running with master 'yarn' > either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. > > This suggests that my environment variables are being used when I first > submit the job, but at some point during the job, my environment variables > are thrown out and someone's (yarn's?) environment variables are being used. > > Andrew > > > On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > > Indeed! Here is the output when I run in cluster mode: > > Traceback (most recent call last): > File "pi.py", line 22, in ? > raise RuntimeError("\n"+str(sys.version_info) +"\n"+ > RuntimeError: > (2, 4, 3, 'final', 0) > [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', > '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), > ('PYTHONUNBUFFERED', 'YES')] > > As we suspected, it is using Python 2.4 > > One thing that surprises me is that PYSPARK_PYTHON is not showing up in the > list, even though I am setting it and exporting it in spark-submit *and* in > spark-env.sh. Is there somewhere else I need to set this variable? Maybe in > one of the hadoop conf files in my HADOOP_CONF_DIR? > > Andrew > > > > On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> wrote: > > It seems like it could be the case that some other Python version is being > invoked. To make sure, can you add something like this to the top of the > .py file you are submitting to get some more info about how the application > master is configured? > > import sys, os > raise RuntimeError("\n"+str(sys.version_info) +"\n"+ > str([(k,os.environ[k]) for k in os.environ if "PY" in k])) > > On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > > Hi Bryan, > > I ran "$> python --version" on every node on the cluster, and it is Python > 2.7.8 for every single one. > > When I try to submit the Python example in client mode > * ./bin/spark-submit --master yarn --deploy-mode client > --driver-memory 4g --executor-memory 2g --executor-cores 1 > ./examples/src/main/python/pi.py 10* > That's when I get this error that I mentioned: > > 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException: > Error from python worker: > python: module pyspark.daemon not found > PYTHONPATH was: > > /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp > > r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164) > at [....] > > followed by several more similar errors that also say: > Error from python worker: > python: module pyspark.daemon not found > > > Even though the default python appeared to be correct, I just went ahead > and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the > default python binary executable. After making this change I was able to > run the job successfully in client! That is, this appeared to fix the > "pyspark.daemon not found" error when running in client mode. > > However, when running in cluster mode, I am still getting the same syntax > error: > > Traceback (most recent call last): > File "pi.py", line 24, in ? > from pyspark import SparkContext > File > "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line > 61 > indent = ' ' * (min(len(m) for m in indents) if indents else 0) > ^ > SyntaxError: invalid syntax > > Is it possible that the PYSPARK_PYTHON environment variable is ignored when > jobs are submitted in cluster mode? It seems that Spark or Yarn is going > behind my back, so to speak, and using some older version of python I didn't > even know was installed. > > Thanks again for all your help thus far. We are getting close.... > > Andrew > > > > On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cutl...@gmail.com> wrote: > > Hi Andrew, > > There are a couple of things to check. First, is Python 2.7 the default > version on all nodes in the cluster or is it an alternate install? Meaning > what is the output of this command "$> python --version" If it is an > alternate install, you could set the environment variable "PYSPARK_PYTHON" > Python binary executable to use for PySpark in both driver and workers > (default is python). > > Did you try to submit the Python example under client mode? Otherwise, > the command looks fine, you don't use the --class option for submitting > python files > * ./bin/spark-submit --master yarn --deploy-mode client > --driver-memory 4g --executor-memory 2g --executor-cores 1 > ./examples/src/main/python/pi.py 10* > > That is a good sign that local jobs and Java examples work, probably just > a small configuration issue :) > > Bryan > > On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > > Thanks for your continuing help. Here is some additional info. > > *OS/architecture* > output of *cat /proc/version*: > Linux version 2.6.18-400.1.1.el5 (mockbu...@x86-012.build.bos.redhat.com) > > output of *lsb_release -a*: > LSB Version: > > :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch > Distributor ID: RedHatEnterpriseServer > Description: Red Hat Enterprise Linux Server release 5.11 (Tikanga) > Release: 5.11 > Codename: Tikanga > > *Running a local job* > I have confirmed that I can successfully run python jobs using > bin/spark-submit --master local[*] > Specifically, this is the command I am using: > *./bin/spark-submit --master local[8] > ./examples/src/main/python/wordcount.py > file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md* > And it works! > > *Additional info* > I am also able to successfully run the Java SparkPi example using yarn in > cluster mode using this command: > * ./bin/spark-submit --class org.apache.spark.examples.SparkPi > --master yarn --deploy-mode cluster --driver-memory 4g > --executor-memory 2g --executor-cores 1 lib/spark-examples*.jar > 10* > This Java job also runs successfully when I change --deploy-mode to > client. The fact that I can run Java jobs in cluster mode makes me thing > that everything is installed correctly--is that a valid assumption? > > The problem remains that I cannot submit python jobs. Here is the command > that I am using to try to submit python jobs: > * ./bin/spark-submit --master yarn --deploy-mode cluster > --driver-memory 4g --executor-memory 2g --executor-cores 1 > ./examples/src/main/python/pi.py 10* > Does that look like a correct command? I wasn't sure what to put for > --class so I omitted it. At any rate, the result of the above command is a > syntax error, similar to the one I posted in the original email: > > Traceback (most recent call last): > File "pi.py", line 24, in ? > from pyspark import SparkContext > File > "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line > 61 > indent = ' ' * (min(len(m) for m in indents) if indents else 0) > ^ > SyntaxError: invalid syntax > > > This really looks to me like a problem with the python version. Python > 2.4 would throw this syntax error but Python 2.7 would not. And yet I am > using Python 2.7.8. Is there any chance that Spark or Yarn is somehow > using an older version of Python without my knowledge? > > Finally, when I try to run the same command in client mode... > * ./bin/spark-submit --master yarn --deploy-mode client > --driver-memory 4g --executor-memory 2g --executor-cores 1 > ./examples/src/main/python/pi.py 10* > I get the error I mentioned in the prior email: > Error from python worker: > python: module pyspark.daemon not found > > Any thoughts? > > Best, > Andrew > > > On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.com> wrote: > > This could be an environment issue, could you give more details about the > OS/architecture that you are using? If you are sure everything is > installed correctly on each node following the guide on "Running Spark on > Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that > the spark assembly jar is reachable, then I would check to see if you can > submit a local job to just run on one node. > > On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner < > andrewweiner2...@u.northwestern.edu> wrote: > > Now for simplicity I'm testing with wordcount.py from the provided > examples, and using Spark 1.6.0 > > The first error I get is: > > 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl > library > java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path > at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864) > at [....] > > A bit lower down, I see this error: > > 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException: > Error from python worker: > python: module pyspark.daemon not found > PYTHONPATH was: > > /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at [....] > > And then a few more similar pyspark.daemon not found errors... > > Andrew > > > > On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com> wrote: > > Hi Andrew, > > I know that older versions of Spark could not run PySpark on YARN in > cluster mode. I'm not sure if that is fixed in 1.6.0 though. Can you try > setting deploy-mode option to "client" when calling spark-submit? > > Bryan > > On Thu, Jan 7, 2016 at 2:39 PM, weineran < > andrewweiner2...@u.northwestern.edu> wrote: > > Hello, > > When I try to submit a python job using spark-submit (using --master yarn > --deploy-mode cluster), I get the following error: > > /Traceback (most recent call last): > File "loss_rate_by_probe.py", line 15, in ? > from pyspark import SparkContext > File > > "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py", > line 41, in ? > File > > "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py", > line 219 > with SparkContext._lock: > ^ > SyntaxError: invalid syntax/ > > This is very similar to this post from 2014 > < > http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html > > > , but unlike that person I am using Python 2.7.8. > > Here is what I'm using: > Spark 1.3.1 > Hadoop 2.4.0.2.1.5.0-695 > Python 2.7.8 > > Another clue: I also installed Spark 1.6.0 and tried to submit the same > job. I got a similar error: > > /Traceback (most recent call last): > File "loss_rate_by_probe.py", line 15, in ? > from pyspark import SparkContext > File > > "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py", > line 61 > indent = ' ' * (min(len(m) for m in indents) if indents else 0) > ^ > SyntaxError: invalid syntax/ > > Any thoughts? > > Andrew > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > > > > > > > > >