Contrib to Docs: Re: SparkContext SyntaxError: invalid syntax

Jim Lohse Mon, 18 Jan 2016 12:47:47 -0800

I don't think you have to build the docs, just fork them on Github andsubmit the pull request?

I have been able to do is submit a pull request just by editing themarkdown file, I am just confused if I am supposed to merge it myself orwait for notification and/or wait for someone else to merge it?

https://github.com/jimlohse/spark/pull/1 ( which I believe everyone cansee, on my end I can merge it because there's no conflicts, should I?)

From

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingDocumentationChanges(which you have probably read so I am mostly posting for others thinkingof contributing to the docs :)

"To have us add a link to an external tutorial you wrote, simply emailthe developer mailing list.To modify the built-in documentation, edit the Markdown source files inSpark's docs directory, whose README file shows how to build thedocumentation locally to test your changes.

The process to propose a doc change is otherwise the same as the processfor proposing code changes below."




On 01/18/2016 07:35 AM, Andrew Weiner wrote:

Hi Felix,

Yeah, when I try to build the docs using jekyll build, I get aLoadError (cannot load such file -- pygments) and I'm having troublegetting past it at the moment.

From what I could tell, this does not apply to YARN in client mode. Iwas able to submit jobs in client mode and they would run fine withoutusing the appMasterEnv property. I even confirmed that my environmentvariables persisted during the job when run in client mode. There issomething about YARN cluster mode that uses a different environment(the YARN Application Master environment) and requires theappMasterEnv property for setting environment variables.

On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung<felixcheun...@hotmail.com <mailto:felixcheun...@hotmail.com>> wrote:


    Do you still need help on the PR?
    btw, does this apply to YARN client mode?

    ------------------------------------------------------------------------
    From: andrewweiner2...@u.northwestern.edu
    <mailto:andrewweiner2...@u.northwestern.edu>
    Date: Sun, 17 Jan 2016 17:00:39 -0600
    Subject: Re: SparkContext SyntaxError: invalid syntax
    To: cutl...@gmail.com <mailto:cutl...@gmail.com>
    CC: user@spark.apache.org <mailto:user@spark.apache.org>


    Yeah, I do think it would be worth explicitly stating this in the
    docs.  I was going to try to edit the docs myself and submit a
    pull request, but I'm having trouble building the docs from
    github.  If anyone else wants to do this, here is approximately
    what I would say:

    (To be added to
    
http://spark.apache.org/docs/latest/configuration.html#environment-variables)
    "Note: When running Spark on YARN in clustermode, environment
    variables need to be set using the
    spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your
    conf/spark-defaults.conf file.  Environment variables that are set
    in spark-env.sh will not be reflected in the YARN Application
    Master process in cluster mode.  See the YARN-related Spark
    Properties
    <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
    for more information."

    I might take another crack at building the docs myself if nobody
    beats me to this.

    Andrew


    On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com
    <mailto:cutl...@gmail.com>> wrote:

        Glad you got it going!  It's wasn't very obvious what needed
        to be set, maybe it is worth explicitly stating this in the
        docs since it seems to have come up a couple times before too.

        Bryan

        On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner
        <andrewweiner2...@u.northwestern.edu
        <mailto:andrewweiner2...@u.northwestern.edu>> wrote:

            Actually, I just found this
            [https://issues.apache.org/jira/browse/SPARK-1680], which
            after a bit of googling and reading leads me to believe
            that the preferred way to change the yarn environment is
            to edit the spark-defaults.conf file by adding this line:
            spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python

            While both this solution and the solution from my prior
            email work, I believe this is the preferred solution.

            Sorry for the flurry of emails.  Again, thanks for all the
            help!

            Andrew

            On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner
            <andrewweiner2...@u.northwestern.edu
            <mailto:andrewweiner2...@u.northwestern.edu>> wrote:

                I finally got the pi.py example to run in yarn cluster
                mode.  This was the key insight:
                https://issues.apache.org/jira/browse/SPARK-9229

                I had to set SPARK_YARN_USER_ENV in spark-env.sh:
                export
                
SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"

                This caused the PYSPARK_PYTHON environment variable to
                be used in my yarn environment in cluster mode.

                Thank you for all your help!

                Best,
                Andrew



                On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner
                <andrewweiner2...@u.northwestern.edu
                <mailto:andrewweiner2...@u.northwestern.edu>> wrote:

                    I tried playing around with my environment
                    variables, and here is an update.

                    When I run in cluster mode, my environment
                    variables do not persist throughout the entire job.
                    For example, I tried creating a local copy of
                    HADOOP_CONF_DIR in
                    /home/<username>/local/etc/hadoop/conf, and then,
                    in spark-env.sh I the variable:
                    export
                    HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf

                    Later, when we print the environment variables in
                    the python code, I see this:

                    ('HADOOP_CONF_DIR', '/etc/hadoop/conf')

                    However, when I run in client mode, I see this:

                    ('HADOOP_CONF_DIR',
                    '/home/awp066/local/etc/hadoop/conf')

                    Furthermore, if I omit that environment variable
                    from spark-env.sh altogether, I get the expected
                    error in both client and cluster mode:

                    When running with master 'yarn' either
                    HADOOP_CONF_DIR or YARN_CONF_DIR must be set in
                    the environment.

                    This suggests that my environment variables are
                    being used when I first submit the job, but at
                    some point during the job, my environment
                    variables are thrown out and someone's (yarn's?)
                    environment variables are being used.

                    Andrew


                    On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner
                    <andrewweiner2...@u.northwestern.edu
                    <mailto:andrewweiner2...@u.northwestern.edu>> wrote:

                        Indeed! Here is the output when I run in
                        cluster mode:

                        Traceback (most recent call last):
                           File "pi.py", line 22, in ?
                             raise RuntimeError("\n"+str(sys.version_info) 
+"\n"+
                        RuntimeError:
                        (2, 4, 3, 'final', 0)
                        [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', 
'/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'),
 ('PYTHONUNBUFFERED', 'YES')]

                        As we suspected, it is using Python 2.4

                        One thing that surprises me is that
                        PYSPARK_PYTHON is not showing up in the list,
                        even though I am setting it and exporting it
                        in spark-submit /and/ in spark-env.sh.  Is
                        there somewhere else I need to set this
                        variable?  Maybe in one of the hadoop conf
                        files in my HADOOP_CONF_DIR?

                        Andrew


                        On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler
                        <cutl...@gmail.com <mailto:cutl...@gmail.com>>
                        wrote:

                            It seems like it could be the case that
                            some other Python version is being
                            invoked.  To make sure, can you add
                            something like this to the top of the .py
                            file you are submitting to get some more
                            info about how the application master is
                            configured?

                            import sys, os
                            raise
                            RuntimeError("\n"+str(sys.version_info)
                            +"\n"+
                            str([(k,os.environ[k]) for k in os.environ
                            if "PY" in k]))

                            On Thu, Jan 14, 2016 at 8:37 AM, Andrew
                            Weiner
                            <andrewweiner2...@u.northwestern.edu
                            <mailto:andrewweiner2...@u.northwestern.edu>>
                            wrote:

                                Hi Bryan,

                                I ran "$> python --version" on every
                                node on the cluster, and it is Python
                                2.7.8 for every single one.

                                When I try to submit the Python
                                example in client mode
                                / ./bin/spark-submit      --master
                                yarn --deploy-mode client
                                --driver-memory 4g --executor-memory
                                2g --executor-cores 1
                                ./examples/src/main/python/pi.py     10/
                                That's when I get this error that I
                                mentioned:/
                                /

                                16/01/14 10:09:10 WARN
                                scheduler.TaskSetManager: Lost task
                                0.0 in stage 0.0 (TID 0,
                                mundonovo-priv):
                                org.apache.spark.SparkException:
                                Error from python worker:
                                  python: module pyspark.daemon not found
                                PYTHONPATH was:
                                
/scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
                                
r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
                                java.io.EOFException
                                at
                                
java.io.DataInputStream.readInt(DataInputStream.java:392)
                                at
                                
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
                                at [....]

                                followed by several more similar
                                errors that also say:
                                Error from python worker:
                                  python: module pyspark.daemon not found


                                Even though the default python
                                appeared to be correct, I just went
                                ahead and explicitly set
                                PYSPARK_PYTHON in conf/spark-env.sh to
                                the path of the default python binary
                                executable. After making this change I
                                was able to run the job successfully
                                in client! That is, this appeared to
                                fix the "pyspark.daemon not found"
                                error when running in client mode.

                                However, when running in cluster mode,
                                I am still getting the same syntax error:

                                Traceback (most recent call last):
                                   File "pi.py", line 24, in ?
                                     from pyspark import SparkContext
                                   File 
"/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
                                     indent = ' ' * (min(len(m) for m in 
indents) if indents else 0)
                                                                                
   ^
                                SyntaxError: invalid syntax

                                Is it possible that the PYSPARK_PYTHON
                                environment variable is ignored when

jobs are submitted in cluster mode?It seems that Spark or Yarn is going

                                behind my back, so to speak, and using
                                some older version of python I didn't
                                even know was installed.

                                Thanks again for all your help thus
                                far.  We are getting close....

                                Andrew


                                On Wed, Jan 13, 2016 at 6:13 PM, Bryan
                                Cutler <cutl...@gmail.com
                                <mailto:cutl...@gmail.com>> wrote:

                                    Hi Andrew,

                                    There are a couple of things to
                                    check.  First, is Python 2.7 the
                                    default version on all nodes in
                                    the cluster or is it an alternate
                                    install? Meaning what is the
                                    output of this command "$>  python
                                    --version"  If it is an alternate
                                    install, you could set the
                                    environment variable
                                    "|PYSPARK_PYTHON|" Python binary
                                    executable to use for PySpark in
                                    both driver and workers (default
                                    is |python|).

                                    Did you try to submit the Python
                                    example under client mode?
                                    Otherwise, the command looks fine,
                                    you don't use the --class option
                                    for submitting python files
                                    / ./bin/spark-submit      --master
                                    yarn --deploy-mode client
                                    --driver-memory 4g
                                    --executor-memory 2g
                                    --executor-cores 1

./examples/src/main/python/pi.py10/


                                    That is a good sign that local
                                    jobs and Java examples work,
                                    probably just a small
                                    configuration issue :)

                                    Bryan

                                    On Wed, Jan 13, 2016 at 3:51 PM,
                                    Andrew Weiner
                                    <andrewweiner2...@u.northwestern.edu
                                    
<mailto:andrewweiner2...@u.northwestern.edu>>
                                    wrote:

                                        Thanks for your continuing
                                        help.  Here is some additional
                                        info.

                                        _OS/architecture_
                                        output of /cat /proc/version/:
                                        Linux version
                                        2.6.18-400.1.1.el5
                                        (mockbu...@x86-012.build.bos.redhat.com
                                        
<mailto:mockbu...@x86-012.build.bos.redhat.com>)

                                        output of /lsb_release -a/:
                                        LSB Version:
                                         
:core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
                                        Distributor ID:
                                        RedHatEnterpriseServer
                                        Description:    Red Hat
                                        Enterprise Linux Server
                                        release 5.11 (Tikanga)
                                        Release:        5.11
                                        Codename:       Tikanga

                                        _Running a local job_
                                        I have confirmed that I can
                                        successfully run python jobs
                                        using bin/spark-submit
                                        --master local[*]
                                        Specifically, this is the
                                        command I am using:
                                        /./bin/spark-submit --master
                                        local[8]
                                        ./examples/src/main/python/wordcount.py
                                        
file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md/
                                        And it works!

                                        _Additional info_
                                        I am also able to successfully
                                        run the Java SparkPi example
                                        using yarn in cluster mode
                                        using this command:
                                        / ./bin/spark-submit --class
                                        org.apache.spark.examples.SparkPi
                                            --master yarn
                                        --deploy-mode cluster
                                        --driver-memory 4g
                                        --executor-memory 2g
                                        --executor-cores 1
                                        lib/spark-examples*.jar     10/
                                        This Java job also runs
                                        successfully when I change
                                        --deploy-mode to client. The
                                        fact that I can run Java jobs
                                        in cluster mode makes me thing
                                        that everything is installed
                                        correctly--is that a valid
                                        assumption?

                                        The problem remains that I
                                        cannot submit python jobs.
                                        Here is the command that I am
                                        using to try to submit python
                                        jobs:

/ ./bin/spark-submit--master yarn --deploy-mode

                                        cluster --driver-memory 4g
                                        --executor-memory 2g
                                        --executor-cores 1
                                        ./examples/src/main/python/pi.py
                                            10/
                                        Does that look like a correct
                                        command?  I wasn't sure what
                                        to put for --class so I
                                        omitted it. At any rate, the
                                        result of the above command is
                                        a syntax error, similar to the
                                        one I posted in the original
                                        email:

                                        Traceback (most recent call last):
                                           File "pi.py", line 24, in ?
                                             from pyspark import SparkContext
                                           File 
"/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
                                             indent = ' ' * (min(len(m) for m 
in indents) if indents else 0)
                                                                                
           ^
                                        SyntaxError: invalid syntax


                                        This really looks to me like a
                                        problem with the python
                                        version. Python 2.4 would
                                        throw this syntax error but
                                        Python 2.7 would not. And yet
                                        I am using Python 2.7.8.  Is
                                        there any chance that Spark or
                                        Yarn is somehow using an older
                                        version of Python without my
                                        knowledge?

                                        Finally, when I try to run the
                                        same command in client mode...

/ ./bin/spark-submit--master yarn --deploy-mode

                                        client --driver-memory 4g
                                        --executor-memory 2g
                                        --executor-cores 1
                                        ./examples/src/main/python/pi.py
                                        10/
                                        I get the error I mentioned in
                                        the prior email:
                                        Error from python worker:
                                        python: module pyspark.daemon
                                        not found

                                        Any thoughts?

                                        Best,
                                        Andrew


                                        On Mon, Jan 11, 2016 at 12:25
                                        PM, Bryan Cutler
                                        <cutl...@gmail.com
                                        <mailto:cutl...@gmail.com>> wrote:

                                            This could be an
                                            environment issue, could
                                            you give more details
                                            about the OS/architecture
                                            that you are using?  If
                                            you are sure everything is
                                            installed correctly on
                                            each node following the
                                            guide on "Running Spark on
                                            Yarn"
                                            
http://spark.apache.org/docs/latest/running-on-yarn.html
                                            and that the spark
                                            assembly jar is reachable,
                                            then I would check to see
                                            if you can submit a local
                                            job to just run on one node.

                                            On Fri, Jan 8, 2016 at
                                            5:22 PM, Andrew Weiner
                                            <andrewweiner2...@u.northwestern.edu
                                            
<mailto:andrewweiner2...@u.northwestern.edu>>
                                            wrote:

                                                Now for simplicity I'm
                                                testing with
                                                wordcount.py from the
                                                provided examples, and
                                                using Spark 1.6.0

                                                The first error I get is:

                                                16/01/08 19:14:46
                                                ERROR
                                                lzo.GPLNativeCodeLoader:
                                                Could not load native
                                                gpl library
                                                java.lang.UnsatisfiedLinkError:
                                                no gplcompression in
                                                java.library.path
                                                at
                                                
java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
                                                at [....]

                                                A bit lower down, I
                                                see this error:

                                                16/01/08 19:14:48 WARN
                                                scheduler.TaskSetManager:
                                                Lost task 0.0 in stage
                                                0.0 (TID 0,
                                                mundonovo-priv):
                                                org.apache.spark.SparkException:
                                                Error from python worker:
                                                  python: module
                                                pyspark.daemon not found
                                                PYTHONPATH was:
                                                
/scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
                                                java.io.EOFException
                                                at
                                                
java.io.DataInputStream.readInt(DataInputStream.java:392)
                                                at [....]

                                                And then a few more
                                                similar pyspark.daemon
                                                not found errors...

                                                Andrew



                                                On Fri, Jan 8, 2016 at
                                                2:31 PM, Bryan Cutler
                                                <cutl...@gmail.com
                                                <mailto:cutl...@gmail.com>>
                                                wrote:

                                                    Hi Andrew,

                                                    I know that older
                                                    versions of Spark
                                                    could not run
                                                    PySpark on YARN in
                                                    cluster mode. I'm
                                                    not sure if that
                                                    is fixed in 1.6.0
                                                    though.  Can you
                                                    try setting
                                                    deploy-mode option
                                                    to "client" when
                                                    calling spark-submit?

                                                    Bryan

                                                    On Thu, Jan 7,
                                                    2016 at 2:39 PM,
                                                    weineran
                                                    
<andrewweiner2...@u.northwestern.edu
                                                    
<mailto:andrewweiner2...@u.northwestern.edu>>
                                                    wrote:

                                                        Hello,

                                                        When I try to
                                                        submit a
                                                        python job
                                                        using
                                                        spark-submit
                                                        (using
                                                        --master yarn
                                                        --deploy-mode
                                                        cluster), I
                                                        get the
                                                        following error:

                                                        /Traceback
                                                        (most recent
                                                        call last):
                                                          File
                                                        "loss_rate_by_probe.py",
                                                        line 15, in ?
                                                            from
                                                        pyspark import
                                                        SparkContext
                                                          File
                                                        
"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
                                                        line 41, in ?
                                                          File
                                                        
"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
                                                        line 219
                                                            with
                                                        SparkContext._lock:
                                                              ^
                                                        SyntaxError:
                                                        invalid syntax/

                                                        This is very
                                                        similar to
                                                        this post from
                                                        2014
                                                        
<http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html>
                                                        , but unlike
                                                        that person I
                                                        am using
                                                        Python 2.7.8.

                                                        Here is what
                                                        I'm using:
                                                        Spark 1.3.1
                                                        Hadoop
                                                        2.4.0.2.1.5.0-695
                                                        Python 2.7.8

                                                        Another clue:
                                                        I also
                                                        installed
                                                        Spark 1.6.0
                                                        and tried to
                                                        submit the same
                                                        job.  I got a
                                                        similar error:

                                                        /Traceback
                                                        (most recent
                                                        call last):
                                                          File
                                                        "loss_rate_by_probe.py",
                                                        line 15, in ?
                                                            from
                                                        pyspark import
                                                        SparkContext
                                                          File
                                                        
"/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
                                                        line 61
                                                            indent = '
                                                        ' *
                                                        (min(len(m)
                                                        for m in
                                                        indents) if
                                                        indents else 0)
                                                                ^
                                                        SyntaxError:
                                                        invalid syntax/

                                                        Any thoughts?

                                                        Andrew



                                                        --
                                                        View this
                                                        message in
                                                        context:
                                                        
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
                                                        Sent from the
                                                        Apache Spark
                                                        User List
                                                        mailing list
                                                        archive at
                                                        Nabble.com.

                                                        
---------------------------------------------------------------------
                                                        To
                                                        unsubscribe,
                                                        e-mail:
                                                        
user-unsubscr...@spark.apache.org
                                                        
<mailto:user-unsubscr...@spark.apache.org>
                                                        For additional
                                                        commands,
                                                        e-mail:
                                                        
user-h...@spark.apache.org
                                                        
<mailto:user-h...@spark.apache.org>

Contrib to Docs: Re: SparkContext SyntaxError: invalid syntax

Reply via email to