Re: Spark on Mesos

Tim Chen Wed, 13 May 2015 10:59:22 -0700

Hi Stephen,

You probably didn't run the Spark driver/shell as root, as Mesos scheduler
will pick up your local user and tries to impersonate as the same user and
chown the directory before executing any task.


If you try to run Spark driver as root it should resolve the problem. No
switch user can also work as it won't try to switch user for you.

Tim

On Wed, May 13, 2015 at 10:50 AM, Stephen Carman <scar...@coldlight.com>
wrote:

> Sander,
>
> I eventually solved this problem via the --[no-]switch_user flag, which is
> set to true by default. I set this to false, which would have the user that
> owns the process run the job, otherwise it was my username (scarman)
> running the job, which would fail because obviously my username didn’t
> exist there. When ran as root, it ran totally fine with no problems what so
> ever.
>
> Hopefully this works for you too,
>
> Steve
> > On May 13, 2015, at 11:45 AM, Sander van Dijk <sgvand...@gmail.com>
> wrote:
> >
> > Hey all,
> >
> > I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1
> with Mesos 0.22.1, with Spark coming from the spark-1.2.1-bin-hadoop2.4.tgz
> prebuilt package, and Mesos installed from the Mesosphere repositories. I
> have been running with Spark standalone successfully for a while and now
> trying to setup Mesos. Mesos is up and running, the UI at port 5050 reports
> all slaves alive. I then run Spark shell with: `spark-shell --master
> mesos://1.1.1.1:5050` (with 1.1.1.1 the master's ip address), which
> starts up fine, with output:
> >
> >     I0513 15:02:45.340287 28804 sched.cpp:448] Framework registered with
> 20150512-150459-2618695596-5050-3956-0009 15/05/13 15:02:45 INFO
> mesos.MesosSchedulerBackend: Registered as framework ID
> 20150512-150459-2618695596-5050-3956-0009
> >
> > and the framework shows up in the Mesos UI. Then when trying to run
> something (e.g. 'val rdd = sc.txtFile("path"); rdd.count') fails with lost
> executors. In /var/log/mesos-slave.ERROR on the slave instances there are
> entries like:
> >
> >     E0513 14:57:01.198995 13077 slave.cpp:3112] Container
> 'eaf33d36-dde5-498a-9ef1-70138810a38c' for executor
> '20150512-145720-2618695596-5050-3082-S10' of framework
> '20150512-150459-2618695596-5050-3956-0009' failed to start: Failed to
> execute mesos-fetcher: Failed to chown work directory
> >
> > From what I can find, the work directory is in /tmp/mesos, where indeed
> I see a directory structure with executor and framework IDs, with at the
> leaves stdout and stderr files of size 0. Everything there is owned by
> root, but I assume the processes are also run by root, so any chowning in
> there should be possible.
> >
> > I was thinking maybe it fails to fetch the Spark package executor? I
> uploaded spark-1.2.1-bin-hadoop2.4.tgz to hdfs, SPARK_EXECUTOR_URI is set
> in spark-env.sh, and in the Environment section of the web UI I see this
> picked up in the spark.executor.uriparameter. I checked and the URI is
> reachable by the slaves: an `hdfs dfs -stat $SPARK_EXECUTOR_URI` is
> successful.
> >
> > Any pointers?
> >
> > Many thanks,
> > Sander
> >
> > On Fri, May 1, 2015 at 8:35 AM Tim Chen <t...@mesosphere.io> wrote:
> > Hi Stephen,
> >
> > It looks like Mesos slave was most likely not able to launch some mesos
> helper processes (fetcher probably?).
> >
> > How did you install Mesos? Did you build from source yourself?
> >
> > Please install Mesos through a package or actually from source run make
> install and run from the installed binary.
> >
> > Tim
> >
> > On Mon, Apr 27, 2015 at 11:11 AM, Stephen Carman <scar...@coldlight.com>
> wrote:
> > So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I
> just basically got the pre-built from the spark website…
> >
> > I placed those compiled spark installs on each slave at /opt/spark
> >
> > My spark properties seem to be getting picked up on my side fine…
> >
> > <Screen Shot 2015-04-27 at 10.30.01 AM.png>
> > The framework is registered in Mesos, it shows up just fine, it doesn’t
> matter if I turn off the executor uri or not, but I always get the same
> error…
> >
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage
> 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor
> 20150424-104711-1375862026-5050-20113-S1 lost)
> > Driver stacktrace:
> > at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
> > at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
> > at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> > at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
> > at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> > at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> > at scala.Option.foreach(Option.scala:236)
> > at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> > at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> > at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> >
> > These boxes are totally open to one another so they shouldn’t have any
> firewall issues, everything seems to show up in mesos and spark just fine,
> but actually running stuff totally blows up.
> >
> > There is nothing in the stderr or stdout, it downloads the package and
> untars it but doesn’t seem to do much after that. Any insights?
> >
> > Steve
> >
> >
> >> On Apr 24, 2015, at 5:50 PM, Yang Lei <genia...@gmail.com> wrote:
> >>
> >> SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST
> >
> > This e-mail is intended solely for the above-mentioned recipient and it
> may contain confidential or privileged information. If you have received it
> in error, please notify us immediately and delete the e-mail. You must not
> copy, distribute, disclose or take any action in reliance on it. In
> addition, the contents of an attachment to this e-mail may contain software
> viruses which could damage your own computer system. While ColdLight
> Solutions, LLC has taken every reasonable precaution to minimize this risk,
> we cannot accept liability for any damage which you sustain as a result of
> software viruses. You should perform your own virus checks before opening
> the attachment.
> >
>
> This e-mail is intended solely for the above-mentioned recipient and it
> may contain confidential or privileged information. If you have received it
> in error, please notify us immediately and delete the e-mail. You must not
> copy, distribute, disclose or take any action in reliance on it. In
> addition, the contents of an attachment to this e-mail may contain software
> viruses which could damage your own computer system. While ColdLight
> Solutions, LLC has taken every reasonable precaution to minimize this risk,
> we cannot accept liability for any damage which you sustain as a result of
> software viruses. You should perform your own virus checks before opening
> the attachment.
>

Re: Spark on Mesos

Reply via email to