Hi Felix,
Yeah, when I try to build the docs using jekyll build, I get a
LoadError (cannot load such file -- pygments) and I'm having trouble
getting past it at the moment.
From what I could tell, this does not apply to YARN in client mode. I
was able to submit jobs in client mode and they would run fine without
using the appMasterEnv property. I even confirmed that my environment
variables persisted during the job when run in client mode. There is
something about YARN cluster mode that uses a different environment
(the YARN Application Master environment) and requires the
appMasterEnv property for setting environment variables.
On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung
<felixcheun...@hotmail.com <mailto:felixcheun...@hotmail.com>> wrote:
Do you still need help on the PR?
btw, does this apply to YARN client mode?
------------------------------------------------------------------------
From: andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>
Date: Sun, 17 Jan 2016 17:00:39 -0600
Subject: Re: SparkContext SyntaxError: invalid syntax
To: cutl...@gmail.com <mailto:cutl...@gmail.com>
CC: user@spark.apache.org <mailto:user@spark.apache.org>
Yeah, I do think it would be worth explicitly stating this in the
docs. I was going to try to edit the docs myself and submit a
pull request, but I'm having trouble building the docs from
github. If anyone else wants to do this, here is approximately
what I would say:
(To be added to
http://spark.apache.org/docs/latest/configuration.html#environment-variables)
"Note: When running Spark on YARN in clustermode, environment
variables need to be set using the
spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your
conf/spark-defaults.conf file. Environment variables that are set
in spark-env.sh will not be reflected in the YARN Application
Master process in cluster mode. See the YARN-related Spark
Properties
<http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
for more information."
I might take another crack at building the docs myself if nobody
beats me to this.
Andrew
On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com
<mailto:cutl...@gmail.com>> wrote:
Glad you got it going! It's wasn't very obvious what needed
to be set, maybe it is worth explicitly stating this in the
docs since it seems to have come up a couple times before too.
Bryan
On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>> wrote:
Actually, I just found this
[https://issues.apache.org/jira/browse/SPARK-1680], which
after a bit of googling and reading leads me to believe
that the preferred way to change the yarn environment is
to edit the spark-defaults.conf file by adding this line:
spark.yarn.appMasterEnv.PYSPARK_PYTHON /path/to/python
While both this solution and the solution from my prior
email work, I believe this is the preferred solution.
Sorry for the flurry of emails. Again, thanks for all the
help!
Andrew
On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>> wrote:
I finally got the pi.py example to run in yarn cluster
mode. This was the key insight:
https://issues.apache.org/jira/browse/SPARK-9229
I had to set SPARK_YARN_USER_ENV in spark-env.sh:
export
SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
This caused the PYSPARK_PYTHON environment variable to
be used in my yarn environment in cluster mode.
Thank you for all your help!
Best,
Andrew
On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>> wrote:
I tried playing around with my environment
variables, and here is an update.
When I run in cluster mode, my environment
variables do not persist throughout the entire job.
For example, I tried creating a local copy of
HADOOP_CONF_DIR in
/home/<username>/local/etc/hadoop/conf, and then,
in spark-env.sh I the variable:
export
HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
Later, when we print the environment variables in
the python code, I see this:
('HADOOP_CONF_DIR', '/etc/hadoop/conf')
However, when I run in client mode, I see this:
('HADOOP_CONF_DIR',
'/home/awp066/local/etc/hadoop/conf')
Furthermore, if I omit that environment variable
from spark-env.sh altogether, I get the expected
error in both client and cluster mode:
When running with master 'yarn' either
HADOOP_CONF_DIR or YARN_CONF_DIR must be set in
the environment.
This suggests that my environment variables are
being used when I first submit the job, but at
some point during the job, my environment
variables are thrown out and someone's (yarn's?)
environment variables are being used.
Andrew
On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>> wrote:
Indeed! Here is the output when I run in
cluster mode:
Traceback (most recent call last):
File "pi.py", line 22, in ?
raise RuntimeError("\n"+str(sys.version_info)
+"\n"+
RuntimeError:
(2, 4, 3, 'final', 0)
[('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH',
'/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'),
('PYTHONUNBUFFERED', 'YES')]
As we suspected, it is using Python 2.4
One thing that surprises me is that
PYSPARK_PYTHON is not showing up in the list,
even though I am setting it and exporting it
in spark-submit /and/ in spark-env.sh. Is
there somewhere else I need to set this
variable? Maybe in one of the hadoop conf
files in my HADOOP_CONF_DIR?
Andrew
On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler
<cutl...@gmail.com <mailto:cutl...@gmail.com>>
wrote:
It seems like it could be the case that
some other Python version is being
invoked. To make sure, can you add
something like this to the top of the .py
file you are submitting to get some more
info about how the application master is
configured?
import sys, os
raise
RuntimeError("\n"+str(sys.version_info)
+"\n"+
str([(k,os.environ[k]) for k in os.environ
if "PY" in k]))
On Thu, Jan 14, 2016 at 8:37 AM, Andrew
Weiner
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>>
wrote:
Hi Bryan,
I ran "$> python --version" on every
node on the cluster, and it is Python
2.7.8 for every single one.
When I try to submit the Python
example in client mode
/ ./bin/spark-submit --master
yarn --deploy-mode client
--driver-memory 4g --executor-memory
2g --executor-cores 1
./examples/src/main/python/pi.py 10/
That's when I get this error that I
mentioned:/
/
16/01/14 10:09:10 WARN
scheduler.TaskSetManager: Lost task
0.0 in stage 0.0 (TID 0,
mundonovo-priv):
org.apache.spark.SparkException:
Error from python worker:
python: module pyspark.daemon not found
PYTHONPATH was:
/scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
java.io.EOFException
at
java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at [....]
followed by several more similar
errors that also say:
Error from python worker:
python: module pyspark.daemon not found
Even though the default python
appeared to be correct, I just went
ahead and explicitly set
PYSPARK_PYTHON in conf/spark-env.sh to
the path of the default python binary
executable. After making this change I
was able to run the job successfully
in client! That is, this appeared to
fix the "pyspark.daemon not found"
error when running in client mode.
However, when running in cluster mode,
I am still getting the same syntax error:
Traceback (most recent call last):
File "pi.py", line 24, in ?
from pyspark import SparkContext
File
"/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
indent = ' ' * (min(len(m) for m in
indents) if indents else 0)
^
SyntaxError: invalid syntax
Is it possible that the PYSPARK_PYTHON
environment variable is ignored when
jobs are submitted in cluster mode?
It seems that Spark or Yarn is going
behind my back, so to speak, and using
some older version of python I didn't
even know was installed.
Thanks again for all your help thus
far. We are getting close....
Andrew
On Wed, Jan 13, 2016 at 6:13 PM, Bryan
Cutler <cutl...@gmail.com
<mailto:cutl...@gmail.com>> wrote:
Hi Andrew,
There are a couple of things to
check. First, is Python 2.7 the
default version on all nodes in
the cluster or is it an alternate
install? Meaning what is the
output of this command "$> python
--version" If it is an alternate
install, you could set the
environment variable
"|PYSPARK_PYTHON|" Python binary
executable to use for PySpark in
both driver and workers (default
is |python|).
Did you try to submit the Python
example under client mode?
Otherwise, the command looks fine,
you don't use the --class option
for submitting python files
/ ./bin/spark-submit --master
yarn --deploy-mode client
--driver-memory 4g
--executor-memory 2g
--executor-cores 1
./examples/src/main/python/pi.py
10/
That is a good sign that local
jobs and Java examples work,
probably just a small
configuration issue :)
Bryan
On Wed, Jan 13, 2016 at 3:51 PM,
Andrew Weiner
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>>
wrote:
Thanks for your continuing
help. Here is some additional
info.
_OS/architecture_
output of /cat /proc/version/:
Linux version
2.6.18-400.1.1.el5
(mockbu...@x86-012.build.bos.redhat.com
<mailto:mockbu...@x86-012.build.bos.redhat.com>)
output of /lsb_release -a/:
LSB Version:
:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID:
RedHatEnterpriseServer
Description: Red Hat
Enterprise Linux Server
release 5.11 (Tikanga)
Release: 5.11
Codename: Tikanga
_Running a local job_
I have confirmed that I can
successfully run python jobs
using bin/spark-submit
--master local[*]
Specifically, this is the
command I am using:
/./bin/spark-submit --master
local[8]
./examples/src/main/python/wordcount.py
file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md/
And it works!
_Additional info_
I am also able to successfully
run the Java SparkPi example
using yarn in cluster mode
using this command:
/ ./bin/spark-submit --class
org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster
--driver-memory 4g
--executor-memory 2g
--executor-cores 1
lib/spark-examples*.jar 10/
This Java job also runs
successfully when I change
--deploy-mode to client. The
fact that I can run Java jobs
in cluster mode makes me thing
that everything is installed
correctly--is that a valid
assumption?
The problem remains that I
cannot submit python jobs.
Here is the command that I am
using to try to submit python
jobs:
/ ./bin/spark-submit
--master yarn --deploy-mode
cluster --driver-memory 4g
--executor-memory 2g
--executor-cores 1
./examples/src/main/python/pi.py
10/
Does that look like a correct
command? I wasn't sure what
to put for --class so I
omitted it. At any rate, the
result of the above command is
a syntax error, similar to the
one I posted in the original
email:
Traceback (most recent call last):
File "pi.py", line 24, in ?
from pyspark import SparkContext
File
"/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
indent = ' ' * (min(len(m) for m
in indents) if indents else 0)
^
SyntaxError: invalid syntax
This really looks to me like a
problem with the python
version. Python 2.4 would
throw this syntax error but
Python 2.7 would not. And yet
I am using Python 2.7.8. Is
there any chance that Spark or
Yarn is somehow using an older
version of Python without my
knowledge?
Finally, when I try to run the
same command in client mode...
/ ./bin/spark-submit
--master yarn --deploy-mode
client --driver-memory 4g
--executor-memory 2g
--executor-cores 1
./examples/src/main/python/pi.py
10/
I get the error I mentioned in
the prior email:
Error from python worker:
python: module pyspark.daemon
not found
Any thoughts?
Best,
Andrew
On Mon, Jan 11, 2016 at 12:25
PM, Bryan Cutler
<cutl...@gmail.com
<mailto:cutl...@gmail.com>> wrote:
This could be an
environment issue, could
you give more details
about the OS/architecture
that you are using? If
you are sure everything is
installed correctly on
each node following the
guide on "Running Spark on
Yarn"
http://spark.apache.org/docs/latest/running-on-yarn.html
and that the spark
assembly jar is reachable,
then I would check to see
if you can submit a local
job to just run on one node.
On Fri, Jan 8, 2016 at
5:22 PM, Andrew Weiner
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>>
wrote:
Now for simplicity I'm
testing with
wordcount.py from the
provided examples, and
using Spark 1.6.0
The first error I get is:
16/01/08 19:14:46
ERROR
lzo.GPLNativeCodeLoader:
Could not load native
gpl library
java.lang.UnsatisfiedLinkError:
no gplcompression in
java.library.path
at
java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at [....]
A bit lower down, I
see this error:
16/01/08 19:14:48 WARN
scheduler.TaskSetManager:
Lost task 0.0 in stage
0.0 (TID 0,
mundonovo-priv):
org.apache.spark.SparkException:
Error from python worker:
python: module
pyspark.daemon not found
PYTHONPATH was:
/scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
java.io.EOFException
at
java.io.DataInputStream.readInt(DataInputStream.java:392)
at [....]
And then a few more
similar pyspark.daemon
not found errors...
Andrew
On Fri, Jan 8, 2016 at
2:31 PM, Bryan Cutler
<cutl...@gmail.com
<mailto:cutl...@gmail.com>>
wrote:
Hi Andrew,
I know that older
versions of Spark
could not run
PySpark on YARN in
cluster mode. I'm
not sure if that
is fixed in 1.6.0
though. Can you
try setting
deploy-mode option
to "client" when
calling spark-submit?
Bryan
On Thu, Jan 7,
2016 at 2:39 PM,
weineran
<andrewweiner2...@u.northwestern.edu
<mailto:andrewweiner2...@u.northwestern.edu>>
wrote:
Hello,
When I try to
submit a
python job
using
spark-submit
(using
--master yarn
--deploy-mode
cluster), I
get the
following error:
/Traceback
(most recent
call last):
File
"loss_rate_by_probe.py",
line 15, in ?
from
pyspark import
SparkContext
File
"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
line 41, in ?
File
"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
line 219
with
SparkContext._lock:
^
SyntaxError:
invalid syntax/
This is very
similar to
this post from
2014
<http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html>
, but unlike
that person I
am using
Python 2.7.8.
Here is what
I'm using:
Spark 1.3.1
Hadoop
2.4.0.2.1.5.0-695
Python 2.7.8
Another clue:
I also
installed
Spark 1.6.0
and tried to
submit the same
job. I got a
similar error:
/Traceback
(most recent
call last):
File
"loss_rate_by_probe.py",
line 15, in ?
from
pyspark import
SparkContext
File
"/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
line 61
indent = '
' *
(min(len(m)
for m in
indents) if
indents else 0)
^
SyntaxError:
invalid syntax/
Any thoughts?
Andrew
--
View this
message in
context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
Sent from the
Apache Spark
User List
mailing list
archive at
Nabble.com.
---------------------------------------------------------------------
To
unsubscribe,
e-mail:
user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional
commands,
e-mail:
user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>