Thanks Tim,

There's a little more to it in fact - if I use the pre-built-with-hadoop-2.6 binaries, all is good (with correctly named tarballs in hdfs). Using the pre-built with user-provided hadoop (including setting SPARK_DIST_CLASSPATH in setup-env.sh) then I get the JNI exception.

Aha - I've found the minimal set of changes that fixes it. I can use the user-provided hadoop tarballs, but I _have_ to add spark-env.sh to them (which I wasn't expecting - I don't recall seeing this anywhere in the docs so I was expecting everything was setup by spark/mesos from the client config).

FWIW, spark-env.sh:
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
#export MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so
export SPARK_EXECUTOR_URI=hdfs:///apps/spark/spark15.tgz

Leaving out SPARK_DIST_CLASSPATH leads to org.apache.hadoop.fs.FSDataInputStream class errors (as you'd expect). Leaving out MESOS_NATIVE_JAVA_LIBRARY seems to have no consequences ATM (it is set in the client).
Leaving out SPARK_EXECUTOR_URI stops the job starting at all.

spark-defaults.conf isn't required to be in the tarball, on the client it's set to: spark.master mesos://zk://mesos-1.example.net:2181,mesos-2.example.net:2181,mesos-3.example.net:2181/mesos
spark.executor.uri hdfs:///apps/spark/spark15.tgz

I guess this is the way forward for us right now, bit uncomfortable as I like to understand why :-)

On 09/09/2015 18:43, Tim Chen wrote:
Hi Adrian,

Spark is expecting a specific naming of the tgz and also the folder name inside, as this is generated by running make-distribution.sh --tgz in the Spark source folder.

If you use a Spark 1.4 tgz generated with that script with the same name and upload to HDFS again, fix the URI then it should work.

Tim

On Wed, Sep 9, 2015 at 8:18 AM, Adrian Bridgett <adr...@opensignal.com <mailto:adr...@opensignal.com>> wrote:

    5mins later...

    Trying 1.5 with a fairly plain build:
    ./make-distribution.sh --tgz --name os1 -Phadoop-2.6

    and on my first attempt stderr showed:
    I0909 15:16:49.392144  1619 fetcher.cpp:441] Fetched
    'hdfs:///apps/spark/spark15.tgz' to
    
'/tmp/mesos/slaves/20150826-133446-3217621258-5050-4064-S1/frameworks/20150826-133446-3217621258-5050-4064-211204/executors/20150826-133446-3217621258-5050-4064-S1/runs/43026ba8-6624-4817-912c-3d7573433102/spark15.tgz'
    sh: 1: cd: can't cd to spark15.tgz
    sh: 1: ./bin/spark-class: not found

    Aha, let's rename the file in hdfs (and the two configs) from
    spark15.tgz to spark-1.5.0-bin-os1.tgz...
    Success!!!

    The same trick with 1.4 doesn't work, but now that I have
    something that does I can make progress.

    Hopefully this helps someone else :-)

    Adrian


    On 09/09/2015 16:59, Adrian Bridgett wrote:
    I'm trying to run spark (1.4.1) on top of mesos (0.23).  I've
    followed the instructions (uploaded spark tarball to HDFS, set
    executor uri in both places etc) and yet on the slaves it's
    failing to lauch even the SparkPi example with a JNI error.  It
    does run with a local master.  A day of debugging later and it's
    time to ask for help!

     bin/spark-submit --master mesos://10.1.201.191:5050
    <http://10.1.201.191:5050> --class
    org.apache.spark.examples.SparkPi /tmp/examples.jar

    (I'm putting the jar outside hdfs  - on both client box + slave
    (turned off other slaves for debugging) - due to
    
http://apache-spark-user-list.1001560.n3.nabble.com/Remote-jar-file-td20649.html.
    I should note that I had the same JNI errors when using the mesos
    cluster dispatcher).

    I'm using Oracle Java 8 (no other java - even openjdk - is installed)

    As you can see, the slave is downloading the framework fine (you
    can even see it extracted on the slave).  Can anyone shed some
    light on what's going on - e.g. how is it attempting to run the
    executor?

    I'm going to try a different JVM (and try a custom spark
    distribution) but I suspect that the problem is much more basic.
    Maybe it can't find the hadoop native libs?

    Any light would be much appreciated :)  I've included the
    slaves's stderr below:

    I0909 14:14:01.405185 32132 logging.cpp:177] Logging to STDERR
    I0909 14:14:01.405256 32132 fetcher.cpp:409] Fetcher Info:
    
{"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/20150826-133446-3217621258-5050-4064-S0\/ubuntu","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/\/apps\/spark\/spark.tgz"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/20150826-133446-3217621258-5050-4064-S0\/frameworks\/20150826-133446-3217621258-5050-4064-211198\/executors\/20150826-133446-3217621258-5050-4064-S0\/runs\/38077da2-553e-4888-bfa3-ece2ab2119f3","user":"ubuntu"}
    I0909 14:14:01.406332 32132 fetcher.cpp:364] Fetching URI
    'hdfs:///apps/spark/spark.tgz'
    I0909 14:14:01.406344 32132 fetcher.cpp:238] Fetching directly
    into the sandbox directory
    I0909 14:14:01.406358 32132 fetcher.cpp:176] Fetching URI
    'hdfs:///apps/spark/spark.tgz'
    I0909 14:14:01.679055 32132 fetcher.cpp:104] Downloading resource
    with Hadoop client from 'hdfs:///apps/spark/spark.tgz' to
    
'/tmp/mesos/slaves/20150826-133446-3217621258-5050-4064-S0/frameworks/20150826-133446-3217621258-5050-4064-211198/executors/20150826-133446-3217621258-5050-4064-S0/runs/38077da2-553e-4888-bfa3-ece2ab2119f3/spark.tgz'
    I0909 14:14:05.492626 32132 fetcher.cpp:76] Extracting with
    command: tar -C
    
'/tmp/mesos/slaves/20150826-133446-3217621258-5050-4064-S0/frameworks/20150826-133446-3217621258-5050-4064-211198/executors/20150826-133446-3217621258-5050-4064-S0/runs/38077da2-553e-4888-bfa3-ece2ab2119f3'
    -xf
    
'/tmp/mesos/slaves/20150826-133446-3217621258-5050-4064-S0/frameworks/20150826-133446-3217621258-5050-4064-211198/executors/20150826-133446-3217621258-5050-4064-S0/runs/38077da2-553e-4888-bfa3-ece2ab2119f3/spark.tgz'
    I0909 14:14:07.489753 32132 fetcher.cpp:84] Extracted
    
'/tmp/mesos/slaves/20150826-133446-3217621258-5050-4064-S0/frameworks/20150826-133446-3217621258-5050-4064-211198/executors/20150826-133446-3217621258-5050-4064-S0/runs/38077da2-553e-4888-bfa3-ece2ab2119f3/spark.tgz'
    into
    
'/tmp/mesos/slaves/20150826-133446-3217621258-5050-4064-S0/frameworks/20150826-133446-3217621258-5050-4064-211198/executors/20150826-133446-3217621258-5050-4064-S0/runs/38077da2-553e-4888-bfa3-ece2ab2119f3'
    W0909 14:14:07.489784 32132 fetcher.cpp:260] Copying instead of
    extracting resource from URI with 'extract' flag, because it does
    not seem to be an archive: hdfs:///apps/spark/spark.tgz
    I0909 14:14:07.489791 32132 fetcher.cpp:441] Fetched
    'hdfs:///apps/spark/spark.tgz' to
    
'/tmp/mesos/slaves/20150826-133446-3217621258-5050-4064-S0/frameworks/20150826-133446-3217621258-5050-4064-211198/executors/20150826-133446-3217621258-5050-4064-S0/runs/38077da2-553e-4888-bfa3-ece2ab2119f3/spark.tgz'
    Error: A JNI error has occurred, please check your installation
    and try again
    Exception in thread "main" java.lang.NoClassDefFoundError:
    org/slf4j/Logger
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
        at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
        at java.lang.Class.getMethod0(Class.java:3018)
        at java.lang.Class.getMethod(Class.java:1784)
        at
    sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
        at
    sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
    Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 7 more



-- *Adrian Bridgett* | Sysadmin Engineer, OpenSignal
    <http://www.opensignal.com>
    _____________________________________________________
    Office: First Floor, Scriptor Court, 155-157 Farringdon Road,
    Clerkenwell, London, EC1R 3AD
    Phone #: +44 777-377-8251
    Skype: abridgett  |@adrianbridgett
    <http://twitter.com/adrianbridgett>| LinkedIn link
    <https://uk.linkedin.com/in/abridgett>
    _____________________________________________________



--
*Adrian Bridgett* | Sysadmin Engineer, OpenSignal <http://www.opensignal.com>
_____________________________________________________
Office: First Floor, Scriptor Court, 155-157 Farringdon Road, Clerkenwell, London, EC1R 3AD
Phone #: +44 777-377-8251
Skype: abridgett |@adrianbridgett <http://twitter.com/adrianbridgett>| LinkedIn link <https://uk.linkedin.com/in/abridgett>
_____________________________________________________

Reply via email to