Re: [VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-25 Thread Stephan Erb
The vote for 1.6.1 appears to have passed. Any chance we can get this released 
soon?

Thanks!


On 19.07.18, 01:11, "Gastón Kleiman"  wrote:

+1 (binding)

Tested on our internal CI. All green!
Tested on CentOS 7 and the following tests failed:

[  FAILED  ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor
[  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
[  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
[  FAILED  ]
NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_NvidiaDockerImage
[  FAILED  ]

bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0,
where GetParam() = true

They are all known to be flaky.

On Wed, Jul 11, 2018 at 6:15 PM Greg Mann  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.6.1.
>
>
> 1.6.1 includes the following:
>
> 

> *Announce major features here*
> *Announce major bug fixes here*
>
> The CHANGELOG for the release is available at:
>
> 
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc2
>
> 

>
> The candidate for Mesos 1.6.1 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz
>
> The tag to be voted on is 1.6.1-rc2:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc2
>
> The SHA512 checksum of the tarball can be found at:
>
> 
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> 
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1230
>
> Please vote on releasing this package as Apache Mesos 1.6.1!
>
> The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.6.1
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Greg
>




Re: Mesos replicated log fills disk with logging output

2018-01-10 Thread Stephan Erb
Thanks for the hint! The cluster is using ext4, and judging from the linked 
thread this could have indeed be caused by a stalling hypervisor.

From: Jie Yu <yujie@gmail.com>
Reply-To: "user@mesos.apache.org" <user@mesos.apache.org>
Date: Monday, 8. January 2018 at 23:36
To: user <user@mesos.apache.org>
Subject: Re: Mesos replicated log fills disk with logging output

Stephan,

I haven't seen that before. A quick Google search suggests that it might be 
related to leveldb. The following thread might be related.
https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ

What is the filesystem you're using?

- JIe

On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erb 
<stephan@blue-yonder.com<mailto:stephan@blue-yonder.com>> wrote:
Hi everyone,

a few days ago, we have bumped into an interesting issue that we had not seen 
before. Essentially, one of our toy clusters dissolved itself:

·  3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper 
(3.4.5) for leader election
·  Master 1 and master 2 had 100% disk usage, because 
/var/lib/mesos/replicated_log/LOG had grown to about 170 GB
·  The replicated log of both Master 1 and 2 was corrupted. A process restart 
did not fix it.
·  The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was 
caused by the full disk.
·  Master 3 was the leading Mesos master and healthy. Its disk usage was normal.


The content of /var/lib/mesos/replicated_log/LOG was an endless stream of:

2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753
2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started
2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK
2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753
2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751
2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started
2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK
2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757
2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760
2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from 
(begin) .. (end); will stop at '003060' @ 9423 : 1
2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761
2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762
2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763

Content of the associated folder:

/var/lib/mesos/replicated_log.corrupted# ls -la
total 964480
drwxr-xr-x 2 mesos mesos  4096 Jan  5 10:12 .
drwxr-xr-x 4 mesos mesos  4096 Jan  5 10:27 ..
-rw-r--r-- 1 mesos mesos   724 Dec 14 16:22 001735.ldb
-rw-r--r-- 1 mesos mesos  7393 Dec 14 16:45 001737.sst
-rw-r--r-- 1 mesos mesos 22129 Jan  3 12:53 001742.sst
-rw-r--r-- 1 mesos mesos 14967 Jan  3 13:00 001747.sst
-rw-r--r-- 1 mesos mesos  7526 Jan  4 12:30 001756.sst
-rw-r--r-- 1 mesos mesos 15113 Jan  5 10:08 001765.sst
-rw-r--r-- 1 mesos mesos 65536 Jan  5 10:09 001767.log
-rw-r--r-- 1 mesos mesos16 Jan  5 10:08 CURRENT
-rw-r--r-- 1 mesos mesos 0 Aug 25  2015 LOCK
-rw-r--r-- 1 mesos mesos 178303865220 Jan  5 10:12 LOG
-rw-r--r-- 1 mesos mesos 463093282 Jan  5 10:08 LOG.old
-rw-r--r-- 1 mesos mesos 65536 Jan  5 10:08 MANIFEST-001764

Monitoring indicates that the disk usage started to grow shortly after a badly 
coordinated configuration deployment change:

·  Master 1 was leading and restarted after a few hours of uptime
·  Master 2 was now leading. After a few seconds (30s-60s or so) it got 
restarted as well
·  Master 3 was now leading (and continued to do so)

I have to admit I am a bit surprised that the restart scenario could lead to 
the issues described above. Has anyone seen similar issues as well?

Thanks and best regards,
Stephan



Mesos replicated log fills disk with logging output

2018-01-08 Thread Stephan Erb
Hi everyone,

a few days ago, we have bumped into an interesting issue that we had not seen 
before. Essentially, one of our toy clusters dissolved itself:


  *   3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper 
(3.4.5) for leader election
  *   Master 1 and master 2 had 100% disk usage, because 
/var/lib/mesos/replicated_log/LOG had grown to about 170 GB
  *   The replicated log of both Master 1 and 2 was corrupted. A process 
restart did not fix it.
  *   The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was 
caused by the full disk.
  *   Master 3 was the leading Mesos master and healthy. Its disk usage was 
normal.


The content of /var/lib/mesos/replicated_log/LOG was an endless stream of:

2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753
2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started
2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK
2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753
2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751
2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started
2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK
2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757
2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760
2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from 
(begin) .. (end); will stop at '003060' @ 9423 : 1
2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761
2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762
2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763

Content of the associated folder:

/var/lib/mesos/replicated_log.corrupted# ls -la
total 964480
drwxr-xr-x 2 mesos mesos  4096 Jan  5 10:12 .
drwxr-xr-x 4 mesos mesos  4096 Jan  5 10:27 ..
-rw-r--r-- 1 mesos mesos   724 Dec 14 16:22 001735.ldb
-rw-r--r-- 1 mesos mesos  7393 Dec 14 16:45 001737.sst
-rw-r--r-- 1 mesos mesos 22129 Jan  3 12:53 001742.sst
-rw-r--r-- 1 mesos mesos 14967 Jan  3 13:00 001747.sst
-rw-r--r-- 1 mesos mesos  7526 Jan  4 12:30 001756.sst
-rw-r--r-- 1 mesos mesos 15113 Jan  5 10:08 001765.sst
-rw-r--r-- 1 mesos mesos 65536 Jan  5 10:09 001767.log
-rw-r--r-- 1 mesos mesos16 Jan  5 10:08 CURRENT
-rw-r--r-- 1 mesos mesos 0 Aug 25  2015 LOCK
-rw-r--r-- 1 mesos mesos 178303865220 Jan  5 10:12 LOG
-rw-r--r-- 1 mesos mesos 463093282 Jan  5 10:08 LOG.old
-rw-r--r-- 1 mesos mesos 65536 Jan  5 10:08 MANIFEST-001764

Monitoring indicates that the disk usage started to grow shortly after a badly 
coordinated configuration deployment change:


  *   Master 1 was leading and restarted after a few hours of uptime
  *   Master 2 was now leading. After a few seconds (30s-60s or so) it got 
restarted as well
  *   Master 3 was now leading (and continued to do so)

I have to admit I am a bit surprised that the restart scenario could lead to 
the issues described above. Has anyone seen similar issues as well?

Thanks and best regards,
Stephan


Re: Problems with OOM

2014-10-07 Thread Stephan Erb
Seems like there is a workaround: I can emulate my desired configuration 
to prevent swap usage, by disabling swap on the host and starting the 
slave without --cgroups_limit_swap. Then everything works as expected, 
i.e., a misbehaving task is killed immediately.


However, I still don't know why 'cgroups_limit_swap' is not working as 
advertised.


Best Regards,
Stephan

On 07.10.2014 12:29, Stephan Erb wrote:
Ok, here is something odd. My kernel is booted using 
cgroup_enable=memory swapaccount=1 in order to enable cgroup accounting.


The log for starting a new container:
I1007 11:38:25.881882  3698 slave.cpp:1222] Queuing task 
'1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1' 
for executor 
thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1
 of framework '20140919-174559-16842879-5050-27194-
I1007 11:38:25.891448  3696 cpushare.cpp:338] Updated 'cpu.shares' to 1280 
(cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.892354  3695 mem.cpp:479] Started listening for OOM events for 
container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.894224  3695 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.897894  3695 mem.cpp:347] Updated 'memory.memsw.limit_in_bytes' 
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.901499  3693 linux_launcher.cpp:191] Cloning child process with 
flags = 0
I1007 11:38:25.982059  3693 containerizer.cpp:678] Checkpointing executor's 
forked pid 3985 to 
'/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/forked.pid'
I1007 11:38:26.170440  3696 containerizer.cpp:510] Fetching URIs for container 
'866af1d4-14df-4e55-be5d-a54e2a573cd7' using command 
'/usr/local/libexec/mesos/mesos-fetcher'
I1007 11:38:26.796327  3692 slave.cpp:2538] Monitoring executor 
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
 of framework '20140919-174559-16842879-5050-27194-' in container 
'866af1d4-14df-4e55-be5d-a54e2a573cd7'
I1007 11:38:27.611901  3691 slave.cpp:1733] Got registration for executor 
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
 of framework 20140919-174559-16842879-5050-27194- from 
executor(1)@127.0.1.1:39709
I1007 11:38:27.612476  3691 slave.cpp:1819] Checkpointing executor pid 
'executor(1)@127.0.1.1:39709' to 
'/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/libprocess.pid'
I1007 11:38:27.614302  3691 slave.cpp:1853] Flushing queued task 
1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 for 
executor 
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
 of framework 20140919-174559-16842879-5050-27194-
I1007 11:38:27.615567  3697 cpushare.cpp:338] Updated 'cpu.shares' to 1280 
(cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:27.615622  3694 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:27.630520  3694 slave.cpp:2088] Handling status update 
TASK_STARTING (UUID: 177f83dd-6669-4ead-8e42-95030e5723e4) for task 
1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 of 
framework 20140919-174559-16842879-5050-27194- from 
executor(1)@127.0.1.1:39709

But when inspecting the limits of my container, they are not enforced 
as expected:


# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.soft_limit_in_bytes
658505728
# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.limit_in_bytes
9223372036854775807
# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.memsw.limit_in_bytes
9223372036854775807

Shouldn't the memsw.limit_in_bytes be set as well?

Best Regards,
Stephan


On 06.10.2014 18:56, Stephan Erb wrote:

Hello,

I am still facing the same issue:

  * My process keeps allocating memory until all available system
memory is used, but it is never killed. Its sandbox is limited to
x00 MB but it ends up using several GB.
  * There is no OOM or cgroup related entry in dmesg (beside the
initialization, i.e., Initializing cgroup subsys memory...)
  * The slave log contains nothing suspicious (see the attached logfile)

Updating my Debian kernel from 3.2 to a backported 3.16 kernel did 
not help. The system is more responsive under load, but the OOM 
killer is still not triggered. I haven't tried running kernelshark on 
any of these kernels, yet.


My used slave command line: /usr/local/sbin/mesos-slave 
--master=zk

Problems with OOM

2014-09-26 Thread Stephan Erb

Hi everyone,

I am having issues with the cgroups isolation of Mesos. It seems like 
tasks are prevented from allocating more memory than their limit. 
However, they are never killed.


 * My scheduled task allocates memory in a tight loop. According to
   'ps', once its memory requirements are exceeded it is not killed,
   but ends up in the state D (uninterruptible sleep (usually IO)).
 * The task is still considered running by Mesos.
 * There is no indication of an OOM in dmesg.
 * There is neither an OOM notice nor any other output related to the
   task in the slave log.
 * According to htop, the system load is increased with a significant
   portion of CPU time spend within the kernel. Commonly the load is so
   high that all zookeeper connections time out.

I am running Aurora and Mesos 0.20.1 using the cgroups isolation on 
Debian 7 (kernel 3.2.60-1+deb7u3). .


Sorry for the somewhat unspecific error description. Still, anyone an 
idea what might be wrong here?


Thanks and Best Regards,
Stephan


Mesos.interface python package

2014-09-26 Thread Stephan Erb

Hello,

could the owner of https://pypi.python.org/pypi/mesos.interface please 
be so kind and upload latest version 0.20.1 to PyPi?


Otherwise the (awesome) egg-files by Mesosphere cannot be installed.

Thanks very much!
Stephan



Re: Problems with OOM

2014-09-26 Thread Stephan Erb
@Tomas: I am currently only running a single slave in a VM. It uses the 
isolator and the logs are clean.

@Tom: Thanks for the interesting hint! I will look into it.

Best Regards,
Stephan

On Fr 26 Sep 2014 16:53:22 CEST, Tom Arnfeld wrote:

I'm not sure if this at all related to the issue you're seeing, but we
ran into this fun issue (or at least this seems to be the cause)
helpfully documented on this blog article:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html.

TLDR: OOM killer getting into an infinite loop, causing the CPU to
spin out of control on our VMs.

More details in this commit message to the OOM killer earlier this
year;
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0c740d0afc3bff0a097ad03a1c8df92757516f5c

Hope this helps somewhat...

On 26 September 2014 14:15, Tomas Barton barton.to...@gmail.com
mailto:barton.to...@gmail.com wrote:

Just to make sure, all slaves are running with:

--isolation='cgroups/cpu,cgroups/mem'

Is there something suspicious in mesos slave logs?

On 26 September 2014 13:20, Stephan Erb
stephan@blue-yonder.com mailto:stephan@blue-yonder.com
wrote:

Hi everyone,

I am having issues with the cgroups isolation of Mesos. It
seems like tasks are prevented from allocating more memory
than their limit. However, they are never killed.

  * My scheduled task allocates memory in a tight loop.
According to 'ps', once its memory requirements are
exceeded it is not killed, but ends up in the state D
(uninterruptible sleep (usually IO)).
  * The task is still considered running by Mesos.
  * There is no indication of an OOM in dmesg.
  * There is neither an OOM notice nor any other output
related to the task in the slave log.
  * According to htop, the system load is increased with a
significant portion of CPU time spend within the kernel.
Commonly the load is so high that all zookeeper
connections time out.

I am running Aurora and Mesos 0.20.1 using the cgroups
isolation on Debian 7 (kernel 3.2.60-1+deb7u3). .

Sorry for the somewhat unspecific error description. Still,
anyone an idea what might be wrong here?

Thanks and Best Regards,
Stephan







Re: Mesos 12.04 Python2.7 Egg

2014-09-16 Thread Stephan Erb

Did you find a solution for your question?

I am currently having similar issues when trying to run the thermos  
executor on Debian 7, which doesn't ship GLIBC 2.16 either. Seems like 
we have to patch the Aurora build process (probably in 
3rdparty/python/BUILD) to download the correct eggs form mesosphere.io 
instead of using the default ones on pypi.


Does anyone have experience in how to do this?

Thanks,
Stephan


On Sa 30 Aug 2014 08:08:24 CEST, Joe Smith wrote:

Howdy all,

I'm to migrating Apache Aurora
http://aurora.incubator.apache.org/ to  mesos 0.20.0[1][2], but am
having an issue using the published dist on PyPI
https://pypi.python.org/pypi/mesos.native/0.20.0. I also gave the
mesosphere-provided (thank you!) egg
http://mesosphere.io/downloads/#apache-mesos-0.20.0 for Ubuntu
12.04, and am getting the same stack trace:

vagrant@192:~$
PYTHONPATH=/home/vagrant/.pex/install/mesos.native-0.20.0-py2.7-linux-x86_64.egg.be6632b790cd03172f858e7f875cdab4ef415ca5/mesos.native-0.20.0-py2.7-linux-x86_64.egg/mesos/
python2.7
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 import mesos
Traceback (most recent call last):
  File stdin, line 1, in module
ImportError: No module named mesos
 import native
Traceback (most recent call last):
  File stdin, line 1, in module
  File
/home/vagrant/.pex/install/mesos.native-0.20.0-py2.7-linux-x86_64.egg.be6632b790cd03172f858e7f875cdab4ef415ca5/mesos.native-0.20.0-py2.7-linux-x86_64.egg/mesos/native/__init__.py,
line 17, in module
from ._mesos import MesosExecutorDriverImpl
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.16' not
found (required by
/home/vagrant/.pex/install/mesos.native-0.20.0-py2.7-linux-x86_64.egg.be6632b790cd03172f858e7f875cdab4ef415ca5/mesos.native-0.20.0-py2.7-linux-x86_64.egg/mesos/native/_mesos.so)


It looks like the issue is it was built with a non-standard glibc (if
I'm following right):

vagrant@192:~/mesos-0.20.0$ /lib/x86_64-linux-gnu/libc.so.6 | grep
release\ version
GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10) stable release version
2.15, by Roland McGrath et al.

Any feedback or suggestions would be greatly appreciated!

Thanks,
Joe

[1] https://reviews.apache.org/r/25208/
[2] https://issues.apache.org/jira/browse/AURORA-674







Pitfalls when writing custom Frameworks

2014-08-31 Thread Stephan Erb
Hi everybody,

I would like to assess the effort required to write a custom framework.

Background: We have an application where we can start a flexible number
of long-running worker processes performing number-crunching. The more
processes the better. However, we have multiple users, each running an
instance of the application and therefore competing for resources (as
each tries to run as many worker processes as possible). 

For various reasons, we would like to run our application instances on
top of mesos. There seem to be two ways to achieve this:

 A. Write a custom framework for our application that spawns the
worker processes on demand. Each user gets to run one framework
instance. We also need preemption of workers to achieve equality
among frameworks. We could achieve this using an external entity
monitoring all frameworks and telling to worst offenders to
scale down a little.
 B. Instead of writing a framework, use a Service-Scheduler like
Marathon, Aurora or Singularity to spawn the worker processes.
Instead of just performing the scale-down, the external entity
would dictate the number of worker processes for each
application depending on its demand.


The first choice seems to be the natural fit for Mesos. However,
existing framework like Aurora seem to be battle-tested in regard to
high availability, race conditions and issues like state reconciliation
where the world view of scheduler and slaves are drifting apart.

So this question boils down to: When considering to write a custom
framework, which pitfalls do I have to be aware of? Can I come away with
blindly implementing the scheduler API? Or do I always have to implement
stuff like custom state-reconciliation in order to prevent orphaned
tasks on slaves (for example, when my framework scheduler crashes or is
temporarily unavailable)?

Thanks for your input!

Best Regards,
Stephan