RE: SparkContext SyntaxError: invalid syntax

Felix Cheung Sun, 17 Jan 2016 21:38:04 -0800

Do you still need help on the PR?
btw, does this apply to YARN client mode?
 
From: andrewweiner2...@u.northwestern.edu
Date: Sun, 17 Jan 2016 17:00:39 -0600
Subject: Re: SparkContext SyntaxError: invalid syntax
To: cutl...@gmail.com
CC: user@spark.apache.org


Yeah, I do think it would be worth explicitly stating this in the docs.  I was 
going to try to edit the docs myself and submit a pull request, but I'm having 
trouble building the docs from github.  If anyone else wants to do this, here 
is approximately what I would say:
(To be added to 
http://spark.apache.org/docs/latest/configuration.html#environment-variables)"Note:
 When running Spark on YARN in cluster mode, environment variables need to be 
set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]  property in 
your conf/spark-defaults.conf file.  Environment variables that are set in 
spark-env.sh will not be reflected in the YARN Application Master process in 
cluster mode.  See the YARN-related Spark Properties for more information."
I might take another crack at building the docs myself if nobody beats me to 
this.
Andrew

On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com> wrote:
Glad you got it going!  It's wasn't very obvious what needed to be set, maybe 
it is worth explicitly stating this in the docs since it seems to have come up 
a couple times before too.
Bryan
On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner 
<andrewweiner2...@u.northwestern.edu> wrote:
Actually, I just found this [https://issues.apache.org/jira/browse/SPARK-1680], 
which after a bit of googling and reading leads me to believe that the 
preferred way to change the yarn environment is to edit the spark-defaults.conf 
file by adding this line:spark.yarn.appMasterEnv.PYSPARK_PYTHON    
/path/to/python

While both this solution and the solution from my prior email work, I believe 
this is the preferred solution.
Sorry for the flurry of emails.  Again, thanks for all the help!
Andrew
On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner 
<andrewweiner2...@u.northwestern.edu> wrote:
I finally got the pi.py example to run in yarn cluster mode.  This was the key 
insight:https://issues.apache.org/jira/browse/SPARK-9229

I had to set SPARK_YARN_USER_ENV in spark-env.sh:export 
SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
This caused the PYSPARK_PYTHON environment variable to be used in my yarn 
environment in cluster mode.
Thank you for all your help!
Best,Andrew


On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner 
<andrewweiner2...@u.northwestern.edu> wrote:
I tried playing around with my environment variables, and here is an update.
When I run in cluster mode, my environment variables do not persist throughout 
the entire job.For example, I tried creating a local copy of HADOOP_CONF_DIR in 
/home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the 
variable:export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
Later, when we print the environment variables in the python code, I see 
this:('HADOOP_CONF_DIR', '/etc/hadoop/conf')However, when I run in client mode, 
I see this:('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
Furthermore, if I omit that environment variable from spark-env.sh altogether, 
I get the expected error in both client and cluster mode:When running with 
master 'yarn'
either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.This 
suggests that my environment variables are being used when I first submit the 
job, but at some point during the job, my environment variables are thrown out 
and someone's (yarn's?) environment variables are being used.Andrew
On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner 
<andrewweiner2...@u.northwestern.edu> wrote:
Indeed!  Here is the output when I run in cluster mode:Traceback (most recent 
call last):
  File "pi.py", line 22, in ?
    raise RuntimeError("\n"+str(sys.version_info) +"\n"+ 
RuntimeError: 
(2, 4, 3, 'final', 0)
[('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', 
'/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'),
 ('PYTHONUNBUFFERED', 'YES')]As we suspected, it is using Python 2.4
One thing that surprises me is that PYSPARK_PYTHON is not showing up in the 
list, even though I am setting it and exporting it in spark-submit and in 
spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in 
one of the hadoop conf files in my HADOOP_CONF_DIR?Andrew

On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> wrote:
It seems like it could be the case that some other Python version is being 
invoked.  To make sure, can you add something like this to the top of the .py 
file you are submitting to get some more info about how the application master 
is configured?

import sys, os
raise RuntimeError("\n"+str(sys.version_info) +"\n"+ 
    str([(k,os.environ[k]) for k in os.environ if "PY" in k]))

On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner 
<andrewweiner2...@u.northwestern.edu> wrote:
Hi Bryan,
I ran "$> python --version" on every node on the cluster, and it is Python 
2.7.8 for every single one.
When I try to submit the Python example in client mode ./bin/spark-submit      
--master yarn     --deploy-mode client     --driver-memory 4g     
--executor-memory 2g     --executor-cores 1     
./examples/src/main/python/pi.py     10That's when I get this error that I 
mentioned:

16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, mundonovo-priv): org.apache.spark.SparkException:Error from python 
worker:  python: module pyspark.daemon not foundPYTHONPATH was:  
/scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zipjava.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)        at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
        at [....]
followed by several more similar errors that also say:Error from python worker: 
 python: module pyspark.daemon not found

Even though the default python appeared to be correct, I just went ahead and 
explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the default 
python binary executable.  After making this change I was able to run the job 
successfully in client!  That is, this appeared to fix the "pyspark.daemon not 
found" error when running in client mode.
However, when running in cluster mode, I am still getting the same syntax 
error:Traceback (most recent call last):
  File "pi.py", line 24, in ?
    from pyspark import SparkContext
  File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", 
line 61
    indent = ' ' * (min(len(m) for m in indents) if indents else 0)
                                                  ^
SyntaxError: invalid syntaxIs it possible that the PYSPARK_PYTHON environment 
variable is ignored when jobs are submitted in cluster mode?  It seems that 
Spark or Yarn is going behind my back, so to speak, and using some older 
version of python I didn't even know was installed.Thanks again for all your 
help thus far.  We are getting close....Andrew


On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cutl...@gmail.com> wrote:
Hi Andrew,

There are a couple of things to check.  First, is Python 2.7 the default 
version on all nodes in the cluster or is it an alternate install? Meaning what 
is the output of this command "$>  python --version"  If it is an alternate 
install, you could set the environment variable "PYSPARK_PYTHON"
    Python binary executable to use for PySpark in both driver and workers 
(default is python).

Did you try to submit the Python example under client mode?  Otherwise, the 
command looks fine, you don't use the --class option for submitting python files
 ./bin/spark-submit      --master yarn     --deploy-mode client     
--driver-memory 4g     --executor-memory 2g     --executor-cores 1     
./examples/src/main/python/pi.py     10

That is a good sign that local jobs and Java examples work, probably just a 
small configuration issue :)

Bryan

On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner 
<andrewweiner2...@u.northwestern.edu> wrote:
Thanks for your continuing help.  Here is some additional info.
OS/architecture
output of cat /proc/version:Linux version 2.6.18-400.1.1.el5 
(mockbu...@x86-012.build.bos.redhat.com)
output of lsb_release -a:LSB Version:    
:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarchDistributor
 ID: RedHatEnterpriseServerDescription:    Red Hat Enterprise Linux Server 
release 5.11 (Tikanga)Release:        5.11Codename:       Tikanga
Running a local jobI have confirmed that I can successfully run python jobs 
using bin/spark-submit --master local[*]Specifically, this is the command I am 
using:./bin/spark-submit --master local[8] 
./examples/src/main/python/wordcount.py 
file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.mdAnd it works!
Additional infoI am also able to successfully run the Java SparkPi example 
using yarn in cluster mode using this command: ./bin/spark-submit --class 
org.apache.spark.examples.SparkPi     --master yarn     --deploy-mode cluster   
  --driver-memory 4g     --executor-memory 2g     --executor-cores 1     
lib/spark-examples*.jar     10This Java job also runs successfully when I 
change --deploy-mode to client.  The fact that I can run Java jobs in cluster 
mode makes me thing that everything is installed correctly--is that a valid 
assumption?
The problem remains that I cannot submit python jobs.  Here is the command that 
I am using to try to submit python jobs: ./bin/spark-submit      --master yarn  
   --deploy-mode cluster     --driver-memory 4g     --executor-memory 2g     
--executor-cores 1     ./examples/src/main/python/pi.py     10Does that look 
like a correct command?  I wasn't sure what to put for --class so I omitted it. 
 At any rate, the result of the above command is a syntax error, similar to the 
one I posted in the original email:Traceback (most recent call last):
  File "pi.py", line 24, in ?
    from pyspark import SparkContext
  File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", 
line 61
    indent = ' ' * (min(len(m) for m in indents) if indents else 0)
                                                  ^
SyntaxError: invalid syntax
This really looks to me like a problem with the python version.  Python 2.4 
would throw this syntax error but Python 2.7 would not.  And yet I am using 
Python 2.7.8.  Is there any chance that Spark or Yarn is somehow using an older 
version of Python without my knowledge?
Finally, when I try to run the same command in client mode... 
./bin/spark-submit      --master yarn     --deploy-mode client     
--driver-memory 4g     --executor-memory 2g     --executor-cores 1     
./examples/src/main/python/pi.py 10I get the error I mentioned in the prior 
email:Error from python worker:  python: module pyspark.daemon not found
Any thoughts?
Best,Andrew

On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.com> wrote:
This could be an environment issue, could you give more details about the 
OS/architecture that you are using?  If you are sure everything is installed 
correctly on each node following the guide on "Running Spark on Yarn" 
http://spark.apache.org/docs/latest/running-on-yarn.html  and that the spark 
assembly jar is reachable, then I would check to see if you can submit a local 
job to just run on one node.

On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner 
<andrewweiner2...@u.northwestern.edu> wrote:
Now for simplicity I'm testing with wordcount.py from the provided examples, 
and using Spark 1.6.0
The first error I get is:
16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl 
libraryjava.lang.UnsatisfiedLinkError: no gplcompression in java.library.path   
     at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)        at 
[....]
A bit lower down, I see this error:
16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, mundonovo-priv): org.apache.spark.SparkException:Error from python 
worker:  python: module pyspark.daemon not foundPYTHONPATH was:  
/scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zipjava.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)        at 
[....]
And then a few more similar pyspark.daemon not found errors...
Andrew


On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com> wrote:
Hi Andrew,

I know that older versions of Spark could not run PySpark on YARN in cluster 
mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try setting 
deploy-mode option to "client" when calling spark-submit?

Bryan

On Thu, Jan 7, 2016 at 2:39 PM, weineran <andrewweiner2...@u.northwestern.edu> 
wrote:
Hello,



When I try to submit a python job using spark-submit (using --master yarn

--deploy-mode cluster), I get the following error:



/Traceback (most recent call last):

  File "loss_rate_by_probe.py", line 15, in ?

    from pyspark import SparkContext

  File

"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",

line 41, in ?

  File

"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",

line 219

    with SparkContext._lock:

                    ^

SyntaxError: invalid syntax/



This is very similar to  this post from 2014

<http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html>

, but unlike that person I am using Python 2.7.8.



Here is what I'm using:

Spark 1.3.1

Hadoop 2.4.0.2.1.5.0-695

Python 2.7.8



Another clue:  I also installed Spark 1.6.0 and tried to submit the same

job.  I got a similar error:



/Traceback (most recent call last):

  File "loss_rate_by_probe.py", line 15, in ?

    from pyspark import SparkContext

  File

"/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",

line 61

    indent = ' ' * (min(len(m) for m in indents) if indents else 0)

                                                  ^

SyntaxError: invalid syntax/



Any thoughts?



Andrew







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



---------------------------------------------------------------------

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

RE: SparkContext SyntaxError: invalid syntax

Reply via email to