Indeed!  Here is the output when I run in cluster mode:

Traceback (most recent call last):
  File "pi.py", line 22, in ?
    raise RuntimeError("\n"+str(sys.version_info) +"\n"+
RuntimeError:
(2, 4, 3, 'final', 0)
[('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH',
'/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'),
('PYTHONUNBUFFERED', 'YES')]

As we suspected, it is using Python 2.4

One thing that surprises me is that PYSPARK_PYTHON is not showing up
in the list, even though I am setting it and exporting it in
spark-submit *and* in spark-env.sh.  Is there somewhere else I need to
set this variable?  Maybe in one of the hadoop conf files in my
HADOOP_CONF_DIR?

Andrew



On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> wrote:

> It seems like it could be the case that some other Python version is being
> invoked.  To make sure, can you add something like this to the top of the
> .py file you are submitting to get some more info about how the application
> master is configured?
>
> import sys, os
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>
> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
>> Hi Bryan,
>>
>> I ran "$> python --version" on every node on the cluster, and it is
>> Python 2.7.8 for every single one.
>>
>> When I try to submit the Python example in client mode
>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>> ./examples/src/main/python/pi.py     10*
>> That's when I get this error that I mentioned:
>>
>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>> Error from python worker:
>>   python: module pyspark.daemon not found
>> PYTHONPATH was:
>>
>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>
>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>> java.io.EOFException
>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>         at
>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>         at [....]
>>
>> followed by several more similar errors that also say:
>> Error from python worker:
>>   python: module pyspark.daemon not found
>>
>>
>> Even though the default python appeared to be correct, I just went ahead
>> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
>> default python binary executable.  After making this change I was able to
>> run the job successfully in client!  That is, this appeared to fix the
>> "pyspark.daemon not found" error when running in client mode.
>>
>> However, when running in cluster mode, I am still getting the same syntax
>> error:
>>
>> Traceback (most recent call last):
>>   File "pi.py", line 24, in ?
>>     from pyspark import SparkContext
>>   File 
>> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", 
>> line 61
>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>                                                   ^
>> SyntaxError: invalid syntax
>>
>> Is it possible that the PYSPARK_PYTHON environment variable is ignored when 
>> jobs are submitted in cluster mode?  It seems that Spark or Yarn is going 
>> behind my back, so to speak, and using some older version of python I didn't 
>> even know was installed.
>>
>> Thanks again for all your help thus far.  We are getting close....
>>
>> Andrew
>>
>>
>>
>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> There are a couple of things to check.  First, is Python 2.7 the default
>>> version on all nodes in the cluster or is it an alternate install? Meaning
>>> what is the output of this command "$>  python --version"  If it is an
>>> alternate install, you could set the environment variable "
>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both
>>> driver and workers (default is python).
>>>
>>> Did you try to submit the Python example under client mode?  Otherwise,
>>> the command looks fine, you don't use the --class option for submitting
>>> python files
>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>> ./examples/src/main/python/pi.py     10*
>>>
>>> That is a good sign that local jobs and Java examples work, probably
>>> just a small configuration issue :)
>>>
>>> Bryan
>>>
>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>> andrewweiner2...@u.northwestern.edu> wrote:
>>>
>>>> Thanks for your continuing help.  Here is some additional info.
>>>>
>>>> *OS/architecture*
>>>> output of *cat /proc/version*:
>>>> Linux version 2.6.18-400.1.1.el5 (
>>>> mockbu...@x86-012.build.bos.redhat.com)
>>>>
>>>> output of *lsb_release -a*:
>>>> LSB Version:
>>>>  
>>>> :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>> Distributor ID: RedHatEnterpriseServer
>>>> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
>>>> Release:        5.11
>>>> Codename:       Tikanga
>>>>
>>>> *Running a local job*
>>>> I have confirmed that I can successfully run python jobs using
>>>> bin/spark-submit --master local[*]
>>>> Specifically, this is the command I am using:
>>>> *./bin/spark-submit --master local[8]
>>>> ./examples/src/main/python/wordcount.py
>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>> And it works!
>>>>
>>>> *Additional info*
>>>> I am also able to successfully run the Java SparkPi example using yarn
>>>> in cluster mode using this command:
>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>> 10*
>>>> This Java job also runs successfully when I change --deploy-mode to
>>>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>>>> that everything is installed correctly--is that a valid assumption?
>>>>
>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>> command that I am using to try to submit python jobs:
>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>> ./examples/src/main/python/pi.py     10*
>>>> Does that look like a correct command?  I wasn't sure what to put for
>>>> --class so I omitted it.  At any rate, the result of the above command is a
>>>> syntax error, similar to the one I posted in the original email:
>>>>
>>>> Traceback (most recent call last):
>>>>   File "pi.py", line 24, in ?
>>>>     from pyspark import SparkContext
>>>>   File 
>>>> "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", 
>>>> line 61
>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>                                                   ^
>>>> SyntaxError: invalid syntax
>>>>
>>>>
>>>> This really looks to me like a problem with the python version.  Python
>>>> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
>>>> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>>> using an older version of Python without my knowledge?
>>>>
>>>> Finally, when I try to run the same command in client mode...
>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>> ./examples/src/main/python/pi.py 10*
>>>> I get the error I mentioned in the prior email:
>>>> Error from python worker:
>>>>   python: module pyspark.daemon not found
>>>>
>>>> Any thoughts?
>>>>
>>>> Best,
>>>> Andrew
>>>>
>>>>
>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.com>
>>>> wrote:
>>>>
>>>>> This could be an environment issue, could you give more details about
>>>>> the OS/architecture that you are using?  If you are sure everything is
>>>>> installed correctly on each node following the guide on "Running Spark on
>>>>> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and
>>>>> that the spark assembly jar is reachable, then I would check to see if you
>>>>> can submit a local job to just run on one node.
>>>>>
>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>> andrewweiner2...@u.northwestern.edu> wrote:
>>>>>
>>>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>>>> examples, and using Spark 1.6.0
>>>>>>
>>>>>> The first error I get is:
>>>>>>
>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>> native gpl library
>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>>>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>         at [....]
>>>>>>
>>>>>> A bit lower down, I see this error:
>>>>>>
>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>> Error from python worker:
>>>>>>   python: module pyspark.daemon not found
>>>>>> PYTHONPATH was:
>>>>>>
>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>> java.io.EOFException
>>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>         at [....]
>>>>>>
>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> I know that older versions of Spark could not run PySpark on YARN in
>>>>>>> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you 
>>>>>>> try
>>>>>>> setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>
>>>>>>> Bryan
>>>>>>>
>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>> andrewweiner2...@u.northwestern.edu> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>> --master yarn
>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>
>>>>>>>> /Traceback (most recent call last):
>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>     from pyspark import SparkContext
>>>>>>>>   File
>>>>>>>>
>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>> line 41, in ?
>>>>>>>>   File
>>>>>>>>
>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>> line 219
>>>>>>>>     with SparkContext._lock:
>>>>>>>>                     ^
>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>
>>>>>>>> This is very similar to  this post from 2014
>>>>>>>> <
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>> >
>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>
>>>>>>>> Here is what I'm using:
>>>>>>>> Spark 1.3.1
>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>> Python 2.7.8
>>>>>>>>
>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit the
>>>>>>>> same
>>>>>>>> job.  I got a similar error:
>>>>>>>>
>>>>>>>> /Traceback (most recent call last):
>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>     from pyspark import SparkContext
>>>>>>>>   File
>>>>>>>>
>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>> line 61
>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>>                                                   ^
>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to