Mesos Spark Tasks - Lost

2015-05-18 Thread Panagiotis Garefalakis
Hello all,

I am facing a weird issue for the last couple of days running Spark on top
of Mesos and I need your help. I am running Mesos in a private cluster and
managed to deploy successfully  hdfs, cassandra, marathon and play but
Spark is not working for a reason. I have tried so far:
different java versions (1.6 and 1.7 oracle and openjdk), different
spark-env configuration, different Spark versions (from 0.8.8 to 1.3.1),
different HDFS versions (hadoop 5.1 and 4.6), and updating pom dependencies.

More specifically while local tasks complete fine, in cluster mode all the
tasks get lost.
(both using spark-shell and spark-submit)
>From the worker log I see something like this:

---
I0519 02:36:30.475064 12863 fetcher.cpp:214] Fetching URI
'hdfs:/:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz'
I0519 02:36:30.747372 12863 fetcher.cpp:99] Fetching URI
'hdfs://X:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' using Hadoop
Client
I0519 02:36:30.747546 12863 fetcher.cpp:109] Downloading resource from
'hdfs://:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' to
'/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz'
I0519 02:36:34.205878 12863 fetcher.cpp:78] Extracted resource
'/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz'
into
'/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3'
*Error: Could not find or load main class two*

---

And from the Spark Terminal:

---
15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at
SparkPi.scala:35
15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at
SparkPi.scala:35
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure:
Lost task 7.3 in stage 0.0 (TID 26, wombat27.doc.res.ic.ac.uk):
ExecutorLostFailure (executor lost)
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
..
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

---

Any help will be greatly appreciated!

Regards,
Panagiotis


Re: cluster confusion after zookeeper blip

2015-05-18 Thread Jeff Schroeder
Not that this is super helpful for your issue, but I ran into an identical
problem this morning with Aurora ontop of mesos where the scheduler was
inoperable due to my ZK ensemble losing quorum and generally acting bad.
However as soon as I fixed the quorum things immediately recovered. I
believe it had to do with the replicated log that Aurora uses.

On Monday, May 18, 2015, Dick Davies  wrote:

> We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
> (mesos 0.21.0, marathon 0.7.5)
>
> This morning we had a network outage long enough for everything to
> lose zookeeper.
> Now our marathon UI is empty (all 3 marathons think someone else is a
> master, and
> marathons 'proxy to leader' feature means the REST API is toast).
>
> Odd thing is, at the mesos level, the
> mesos master UI shows no tasks running (logs mention orphaned tasks),
> but if i click into the 'slaves' tab and dig down, the slave view details
> tasks
> that are in fact active.
>
> Any way to bring order to this without needing to kill those tasks? we
> have no actual outage from a user point of view, but the cluster
> itself is pretty confused and our service discovery relies on the
> marathon API which is timing out.
>
> Although mesos has checkpointing enabled, marathon isn't running with
> checkpointing on (it's the default now but doesn't apply to existing
> frameworks apparently, and we started this around marathon 0.4.x)
>
> Would enabling checkpointing help with this kind of issue? If so, how
> do i enable it for an existing framework?
>


-- 
Text by Jeff, typos by iPhone


Re: mesos slave doesn't pick up tasks after restart

2015-05-18 Thread Grzegorz Graczyk
Thanks a lot! :) I couldn’t find any corresponding issue.
> On 18 May 2015, at 19:37, Cody Maloney  wrote:
> 
> Running mesos slave inside of a docker container and having working slave 
> task recovery isn't supported at the moment. See: 
> https://issues.apache.org/jira/browse/MESOS-2115 
> 
> 
> On Mon, May 18, 2015 at 4:47 AM, Grzegorz Graczyk  > wrote:
> 3-node cluster
> CoreOS 675.0.0
> Mesos 0.22.1
> Marathon 0.8.2-RC2
> 
> Everything is run in containers, mesos slave run using command: 
> /usr/bin/docker run \
> --rm \
> --net=host \
> --pid=host \
> --name slave \
> -v /data/server/mesos-slave:/data/mesos-slave \
> -v /root/.dockercfg:/etc/.dockercfg \
> --privileged \
> -v /usr/lib64/libdevmapper.so.1.02:/usr/lib/libdevmapper.so.1.02 \
> -v /var/run/docker.sock:/var/run/docker.sock \
> -v /usr/bin/docker:/usr/local/bin/docker \
> -v /sys/fs/cgroup:/host/sys/fs/cgroup \
> -e GLOG_v=1 \
> mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404 --containerizers=docker,mesos 
> --master=zk://`/get-zookeeper-peers.sh`/mesos <> 
> --hostname=private.`hostname` --ip=${ENS224_IPV4} 
> --resources=\"ports(*):[31000-32000]\" 
> --cgroups_hierarchy=/host/sys/fs/cgroup --work_dir=/data/mesos-slave 
> --logging_level=INFO
> 
> After slave restart it succesfully re-registers in mesos master, then it 
> kills all tasks and starts them again. 
> Same happens when using mesos containerizer.
> 
> Full logs: https://gist.github.com/gregory90/dd6930495fd655cf6691 
> 
> 
> Any help appreciated.
> 



Re: mesos slave doesn't pick up tasks after restart

2015-05-18 Thread Cody Maloney
Running mesos slave inside of a docker container and having working slave
task recovery isn't supported at the moment. See:
https://issues.apache.org/jira/browse/MESOS-2115

On Mon, May 18, 2015 at 4:47 AM, Grzegorz Graczyk 
wrote:

> 3-node cluster
> CoreOS 675.0.0
> Mesos 0.22.1
> Marathon 0.8.2-RC2
>
> Everything is run in containers, mesos slave run using command:
> /usr/bin/docker run \
> --rm \
> --net=host \
> --pid=host \
> --name slave \
> -v /data/server/mesos-slave:/data/mesos-slave \
> -v /root/.dockercfg:/etc/.dockercfg \
> --privileged \
> -v /usr/lib64/libdevmapper.so.1.02:/usr/lib/libdevmapper.so.1.02 \
> -v /var/run/docker.sock:/var/run/docker.sock \
> -v /usr/bin/docker:/usr/local/bin/docker \
> -v /sys/fs/cgroup:/host/sys/fs/cgroup \
> -e GLOG_v=1 \
> mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404 --containerizers=docker,mesos
> --master=zk://`/get-zookeeper-peers.sh`/mesos
> --hostname=private.`hostname` --ip=${ENS224_IPV4}
> --resources=\"ports(*):[31000-32000]\"
> --cgroups_hierarchy=/host/sys/fs/cgroup --work_dir=/data/mesos-slave
> --logging_level=INFO
>
> After slave restart it succesfully re-registers in mesos master, then it
> kills all tasks and starts them again.
> Same happens when using mesos containerizer.
>
> Full logs: https://gist.github.com/gregory90/dd6930495fd655cf6691
>
> Any help appreciated.
>


Re: make[3]: *** [check-local] Aborted (core dumped) in make test

2015-05-18 Thread haosdent
@Joerg Maurer I could not reproduce your problems in CentOS. From this
ticket[https://issues.apache.org/jira/browse/MESOS-2744],  @Colin Williams
also could not reproduce your problems in Ubuntu which kernel
is 3.13.0-35-generic. So could you sure the problem is exist in the latest
code? Thank you

On Sun, May 17, 2015 at 6:24 PM, haosdent  wrote:

> Thank you for your reply, I fill this issue
> https://issues.apache.org/jira/browse/MESOS-2744
>
> On Sun, May 17, 2015 at 5:08 AM, Joerg Maurer  wrote:
>
>> Hello haosdent,
>>
>> See (1) and (2), just executed in that order.
>>
>> Results make for me - from a blackbox point of view - no sense at all. My
>> two cents/theory - tests themselfs(t.i. the framework's they use) seem to
>> affect each other.
>>
>> Will file an issue in your JIRA. Pls provide info for access/handling
>> your JIRA e.g. is this email as description enough information for your
>> investigation?
>>
>> (1)
>>
>> joma@kopernikus-u:~/dev/programme/mesos/build/mesos/build$ make check
>> GTEST_FILTER="MasterAuthorizationTest.SlaveRemoved" GTEST_REPEAT=1000
>> GTEST_BREAK_ON_FAILURE=1
>> ...
>> Repeating all tests (iteration 1000) . . .
>>
>> Note: Google Test filter =
>> MasterAuthorizationTest.SlaveRemoved-DockerContainerizerTest.ROOT_DOCKER_Launch_Executor:DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged:DockerContainerizerTest.ROOT_DOCKER_Launch:DockerContainerizerTest.ROOT_DOCKER_Kill:DockerContainerizerTest.ROOT_DOCKER_Usage:DockerContainerizerTest.ROOT_DOCKER_Update:DockerContainerizerTest.DISABLED_ROOT_DOCKER_Recover:DockerContainerizerTest.ROOT_DOCKER_SkipRecoverNonDocker:DockerContainerizerTest.ROOT_DOCKER_Logs:DockerContainerizerTest.ROOT_DOCKER_Default_CMD:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Override:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Args:DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer:DockerContainerizerTest.DISABLED_ROOT_DOCKER_SlaveRecoveryExecutorContainer:DockerContainerizerTest.ROOT_DOCKER_PortMapping:DockerContainerizerTest.ROOT_DOCKER_LaunchSandboxWithColon:DockerContainerizerTest.ROOT_DOCKER_DestroyWhileFetching:DockerContainerizerTest.ROOT_DOCKER_Destr
>> o
>>
>> yWhilePulling:DockerTest.ROOT_DOCKER_interface:DockerTest.ROOT_DOCKER_CheckCommandWithShell:DockerTest.ROOT_DOCKER_CheckPortResource:DockerTest.ROOT_DOCKER_CancelPull:CpuIsolatorTest/1.UserCpuUsage:CpuIsolatorTest/1.SystemCpuUsage:LimitedCpuIsolatorTest.ROOT_CGROUPS_Cfs:LimitedCpuIsolatorTest.ROOT_CGROUPS_Cfs_Big_Quota:MemIsolatorTest/0.MemUsage:MemIsolatorTest/1.MemUsage:MemIsolatorTest/2.MemUsage:PerfEventIsolatorTest.ROOT_CGROUPS_Sample:SharedFilesystemIsolatorTest.ROOT_RelativeVolume:SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume:NamespacesPidIsolatorTest.ROOT_PidNamespace:UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup:UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup:UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward:SlaveTest.ROOT_RunTaskWithCommandInfoWithoutUser:SlaveTest.DI
>> S
>>
>> ABLED_ROOT_RunTaskWithCommandInfoWithUser:ContainerizerTest.ROOT_CGROUPS_BalloonFramework:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Enabled:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Mounted:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get:CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Tasks:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Read:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Write:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Cfs_Big_Quota:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Busy:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_SubsystemsHierarchy:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FindCgroupSubsystems:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_MountedSubsystems:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_CreateRemove:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer:CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_
>> M
>>
>> ountUnmountHierarchy:CgroupsAnyHierarchyWithCpuAcctMemoryTest.ROOT_CGROUPS_Stat:CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_Freeze:CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_Kill:CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_Destroy:CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_AssignThreads:CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyStoppedProcess:CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess:CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf:NsTest.ROOT_setns:NsTest.ROOT_setnsMultipleThreads:NsTest.ROOT_getns:NsTest.ROOT_destroy:PerfTest.ROOT_Events:PerfTest.ROOT_SampleInit:SlaveCount/Registrar_BENCHMARK_Test.performance/0:SlaveCount/Registrar_BENCHMARK_Test.performance/1:SlaveCount/Registrar_BENCHMARK_Test.performance/2:SlaveCou

Re: cluster confusion after zookeeper blip

2015-05-18 Thread Dick Davies
Thanks Nikolay - I checked the frameworkid in zookeeper
(/marathon/state/frameworkId) matched the
one attached to the running tasks, gave the old marathon leader a
restart and everything reconnected ok

(we did have to disable our service discovery pieces to avoid getting
empty JSON back when marathon
first booted, but other than that everything is peachy).


On 18 May 2015 at 15:31, Nikolay Borodachev  wrote:
> Have you tried to restart Marathon and Mesos processes? Once you restart them 
> they should pick zookeepers, elect leaders, etc.
> If you're using Docker containers, they should reattach themselves to the 
> respective slaves.
>
> Thanks
> Nikolay
>
> -Original Message-
> From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick 
> Davies
> Sent: Monday, May 18, 2015 5:26 AM
> To: user@mesos.apache.org
> Subject: cluster confusion after zookeeper blip
>
> We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
> (mesos 0.21.0, marathon 0.7.5)
>
> This morning we had a network outage long enough for everything to lose 
> zookeeper.
> Now our marathon UI is empty (all 3 marathons think someone else is a master, 
> and marathons 'proxy to leader' feature means the REST API is toast).
>
> Odd thing is, at the mesos level, the
> mesos master UI shows no tasks running (logs mention orphaned tasks), but if 
> i click into the 'slaves' tab and dig down, the slave view details tasks that 
> are in fact active.
>
> Any way to bring order to this without needing to kill those tasks? we have 
> no actual outage from a user point of view, but the cluster itself is pretty 
> confused and our service discovery relies on the marathon API which is timing 
> out.
>
> Although mesos has checkpointing enabled, marathon isn't running with 
> checkpointing on (it's the default now but doesn't apply to existing 
> frameworks apparently, and we started this around marathon 0.4.x)
>
> Would enabling checkpointing help with this kind of issue? If so, how do i 
> enable it for an existing framework?


Re: Writing outside the sandbox

2015-05-18 Thread John Omernik
So I did some testing today, I was able to recreate the exact ID string on
a server with access to the share. (Remember the ID string that marathon
runs is different than the standard user... for some reason the only groups
that show up in marathon are the user's group and root(0).   I recreated
that exact same setup, and was still able to create the files running
directly (not through mesos/marathon).  The containerization I am using is
docker,mesos would that play a role here?

Any other thoughts on what could be blocking the write?

On Tue, May 12, 2015 at 3:09 PM, John Omernik  wrote:

> Root IS able to write to the share outside of Mesos. I am working with
> MapR to understand the NFS component better.
>
>
>
> On Tue, May 12, 2015 at 11:28 AM, Bjoern Metzdorf 
> wrote:
>
>> Is there anything in the nfs server log files? Maybe it squashes root by
>> default and the root group membership of darkness falls into that?
>>
>> Regards,
>> Bjoern
>>
>> On May 12, 2015, at 5:53 AM, John Omernik  wrote:
>>
>> So I tried su darkness and su - darkness and both allowed a file write
>> with no issues.  On the group thing, while it is "weird" would that
>> actually hurt ti to contain that group?  Even if I set the directory to 777
>> I still get a failure. on a create within it.  I am guessing this is
>> something more to do with MapRs NFS than Mesos at this point, but if anyone
>> would have any other tips on troubleshooting to confirm that, I'd
>> appreciate it.
>>
>> John
>>
>> On Mon, May 11, 2015 at 5:18 PM, Marco Massenzio 
>> wrote:
>>
>>> Looks to me that while 'uid' is 1000
>>> uid=1000(darkness) gid=1000(darkness) groups=1000(darkness),0(root)
>>>
>>> this is still root's env when run from Mesos (also, weird that groups
>>> contains 0(root)):
>>> USER=root
>>>
>>> again - not sure how we su to a different user, but this usually happens
>>> if one does `su darkness` (instead of `su - darkness`) from the shell, at
>>> any rate.
>>>
>>> *Marco Massenzio*
>>> *Distributed Systems Engineer*
>>>
>>> On Mon, May 11, 2015 at 6:54 AM, John Omernik  wrote:
>>>
 Paul: I checked in multiple places and I don't see rootsquash being
 used. I am using the MapR NFS server, and I do not believe that is a common
 option in the default setup ( I will follow up closer on that).

 Adam and Maxime:  So I included the output of both id (instead of
 whoami) and env (as seen below) and I believe that your ideas may be
 getting somewhere.  There are a number of things that strike me as odd in
 the outputs, and I'd like your thoughts on them.  First of all, remember
 that the permissions on the folders are 775 right now, so with the primary
 group set (which it appears to be based on id) and the user set, it still
 should have write access.  That said, the SUed process doesn't have any of
 the other groups (which I want to test if any of those controls access,
 especially with MapR). At risk of exposing to much information about my
 test network in a public forum, I left all the details in the ENV to see if
 there are things other may see that could be causing me issues.

 Thanks for the replies so far!





 *New Script:*

 #!/bin/bash

 echo "Writing id information to stderr for one stop logging" 1>&2

 id 1>&2


 echo "" 1>&2


 echo "Printing out the env command to std err for one stop loggins" 1>&
 2

 env 1>&2


 mkdir /mapr/brewpot/mesos/storm/test/test1

 touch /mapr/brewpot/mesos/storm/test/test1/testing.go





 *Run within Mesos:*

 I0511 08:41:02.804448  8048 exec.cpp:132] Version: 0.21.0
 I0511 08:41:02.814324  8059 exec.cpp:206] Executor registered on slave
 20150505-145508-1644210368-5050-8608-S2
 Writing id information to stderr for one stop logging
 uid=1000(darkness) gid=1000(darkness) groups=1000(darkness),0(root)

 Printing out the env command to std err for one stop loggins
 LIBPROCESS_IP=192.168.0.98
 HOST=hadoopmapr3.brewingintel.com
 SHELL=/bin/bash
 TERM=unknown
 PORT_10005=31783

 MESOS_DIRECTORY=/tmp/mesos/slaves/20150505-145508-1644210368-5050-8608-S2/frameworks/20150302-094409-1644210368-5050-2134-0003/executors/permtest.5f822976-f7e3-11e4-a22d-56847afe9799/runs/e53dc010-dd3c-4993-8f39-f8b532e5cf8b
 PORT0=31783
 MESOS_TASK_ID=permtest.5f822976-f7e3-11e4-a22d-56847afe9799
 USER=root
 LD_LIBRARY_PATH=:/usr/local/lib
 SUDO_USER=darkness
 MESOS_EXECUTOR_ID=permtest.5f822976-f7e3-11e4-a22d-56847afe9799
 SUDO_UID=1000
 USERNAME=root

 PATH=/home/darkness:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 MAIL=/var/mail/root

 PWD=/opt/mapr/mesos/tmp/slave/slaves/20150505-145508-1644210368-5050-8608-S2/frameworks/20150302-094409-1644210368-5050-2134-0003/executors/permtest.5f822976-f7e3-11e4-a22d-56

RE: cluster confusion after zookeeper blip

2015-05-18 Thread Nikolay Borodachev
Have you tried to restart Marathon and Mesos processes? Once you restart them 
they should pick zookeepers, elect leaders, etc.
If you're using Docker containers, they should reattach themselves to the 
respective slaves.

Thanks
Nikolay

-Original Message-
From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick Davies
Sent: Monday, May 18, 2015 5:26 AM
To: user@mesos.apache.org
Subject: cluster confusion after zookeeper blip

We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
(mesos 0.21.0, marathon 0.7.5)

This morning we had a network outage long enough for everything to lose 
zookeeper.
Now our marathon UI is empty (all 3 marathons think someone else is a master, 
and marathons 'proxy to leader' feature means the REST API is toast).

Odd thing is, at the mesos level, the
mesos master UI shows no tasks running (logs mention orphaned tasks), but if i 
click into the 'slaves' tab and dig down, the slave view details tasks that are 
in fact active.

Any way to bring order to this without needing to kill those tasks? we have no 
actual outage from a user point of view, but the cluster itself is pretty 
confused and our service discovery relies on the marathon API which is timing 
out.

Although mesos has checkpointing enabled, marathon isn't running with 
checkpointing on (it's the default now but doesn't apply to existing frameworks 
apparently, and we started this around marathon 0.4.x)

Would enabling checkpointing help with this kind of issue? If so, how do i 
enable it for an existing framework?


mesos slave doesn't pick up tasks after restart

2015-05-18 Thread Grzegorz Graczyk
3-node cluster
CoreOS 675.0.0
Mesos 0.22.1
Marathon 0.8.2-RC2

Everything is run in containers, mesos slave run using command: 
/usr/bin/docker run \
--rm \
--net=host \
--pid=host \
--name slave \
-v /data/server/mesos-slave:/data/mesos-slave \
-v /root/.dockercfg:/etc/.dockercfg \
--privileged \
-v /usr/lib64/libdevmapper.so.1.02:/usr/lib/libdevmapper.so.1.02 \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /usr/bin/docker:/usr/local/bin/docker \
-v /sys/fs/cgroup:/host/sys/fs/cgroup \
-e GLOG_v=1 \
mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404 --containerizers=docker,mesos 
--master=zk://`/get-zookeeper-peers.sh`/mesos --hostname=private.`hostname` 
--ip=${ENS224_IPV4} --resources=\"ports(*):[31000-32000]\" 
--cgroups_hierarchy=/host/sys/fs/cgroup --work_dir=/data/mesos-slave 
--logging_level=INFO

After slave restart it succesfully re-registers in mesos master, then it kills 
all tasks and starts them again. 
Same happens when using mesos containerizer.

Full logs: https://gist.github.com/gregory90/dd6930495fd655cf6691 


Any help appreciated.

cluster confusion after zookeeper blip

2015-05-18 Thread Dick Davies
We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
(mesos 0.21.0, marathon 0.7.5)

This morning we had a network outage long enough for everything to
lose zookeeper.
Now our marathon UI is empty (all 3 marathons think someone else is a
master, and
marathons 'proxy to leader' feature means the REST API is toast).

Odd thing is, at the mesos level, the
mesos master UI shows no tasks running (logs mention orphaned tasks),
but if i click into the 'slaves' tab and dig down, the slave view details tasks
that are in fact active.

Any way to bring order to this without needing to kill those tasks? we
have no actual outage from a user point of view, but the cluster
itself is pretty confused and our service discovery relies on the
marathon API which is timing out.

Although mesos has checkpointing enabled, marathon isn't running with
checkpointing on (it's the default now but doesn't apply to existing
frameworks apparently, and we started this around marathon 0.4.x)

Would enabling checkpointing help with this kind of issue? If so, how
do i enable it for an existing framework?


Re: Medallia powered by Mesos

2015-05-18 Thread Adam Bordelon
I have added Medallia to the Mesos adopters list. It will show up in the
next website update.
Thanks for using Mesos! See you at MesosCon?

On Sun, May 17, 2015 at 4:56 PM, Anirudha Jadhav  wrote:

> +1
>
> On Mon, May 18, 2015 at 2:15 AM, Mauricio Garavaglia <
> mauri...@medallia.com> wrote:
>
>> Hello guys,
>>
>> At Medallia we are using Mesos as the foundation for our new
>> microservices architecture; just started with a small QA cluster of 64
>> nodes (about  2000 CPU and 15TB of ram).
>>
>> Please add us to the powered-by-mesos page, Thanks!
>>
>>
>> Mauricio Garavaglia
>> Senior Software Engineer
>> Medallia - www.medallia.com
>>
>
>
>
> --
> Anirudha P. Jadhav
>