Re: [VOTE] Release Apache Mesos 1.6.1 (rc2)
The vote for 1.6.1 appears to have passed. Any chance we can get this released soon? Thanks! On 19.07.18, 01:11, "Gastón Kleiman" wrote: +1 (binding) Tested on our internal CI. All green! Tested on CentOS 7 and the following tests failed: [ FAILED ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor [ FAILED ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs [ FAILED ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen [ FAILED ] NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_NvidiaDockerImage [ FAILED ] bool/UserContainerLoggerTest.ROOT_LOGROTATE_RotateWithSwitchUserTrueOrFalse/0, where GetParam() = true They are all known to be flaky. On Wed, Jul 11, 2018 at 6:15 PM Greg Mann wrote: > Hi all, > > Please vote on releasing the following candidate as Apache Mesos 1.6.1. > > > 1.6.1 includes the following: > > > *Announce major features here* > *Announce major bug fixes here* > > The CHANGELOG for the release is available at: > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc2 > > > > The candidate for Mesos 1.6.1 release is available at: > https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz > > The tag to be voted on is 1.6.1-rc2: > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc2 > > The SHA512 checksum of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.sha512 > > The signature of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.asc > > The PGP key used to sign the release is here: > https://dist.apache.org/repos/dist/release/mesos/KEYS > > The JAR is in a staging repository here: > https://repository.apache.org/content/repositories/orgapachemesos-1230 > > Please vote on releasing this package as Apache Mesos 1.6.1! > > The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a > majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Mesos 1.6.1 > [ ] -1 Do not release this package because ... > > Thanks, > Greg >
Re: Mesos replicated log fills disk with logging output
Thanks for the hint! The cluster is using ext4, and judging from the linked thread this could have indeed be caused by a stalling hypervisor. From: Jie Yu <yujie@gmail.com> Reply-To: "user@mesos.apache.org" <user@mesos.apache.org> Date: Monday, 8. January 2018 at 23:36 To: user <user@mesos.apache.org> Subject: Re: Mesos replicated log fills disk with logging output Stephan, I haven't seen that before. A quick Google search suggests that it might be related to leveldb. The following thread might be related. https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ What is the filesystem you're using? - JIe On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erb <stephan@blue-yonder.com<mailto:stephan@blue-yonder.com>> wrote: Hi everyone, a few days ago, we have bumped into an interesting issue that we had not seen before. Essentially, one of our toy clusters dissolved itself: · 3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper (3.4.5) for leader election · Master 1 and master 2 had 100% disk usage, because /var/lib/mesos/replicated_log/LOG had grown to about 170 GB · The replicated log of both Master 1 and 2 was corrupted. A process restart did not fix it. · The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was caused by the full disk. · Master 3 was the leading Mesos master and healthy. Its disk usage was normal. The content of /var/lib/mesos/replicated_log/LOG was an endless stream of: 2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753 2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started 2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK 2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753 2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751 2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started 2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK 2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757 2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760 2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from (begin) .. (end); will stop at '003060' @ 9423 : 1 2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761 2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762 2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763 Content of the associated folder: /var/lib/mesos/replicated_log.corrupted# ls -la total 964480 drwxr-xr-x 2 mesos mesos 4096 Jan 5 10:12 . drwxr-xr-x 4 mesos mesos 4096 Jan 5 10:27 .. -rw-r--r-- 1 mesos mesos 724 Dec 14 16:22 001735.ldb -rw-r--r-- 1 mesos mesos 7393 Dec 14 16:45 001737.sst -rw-r--r-- 1 mesos mesos 22129 Jan 3 12:53 001742.sst -rw-r--r-- 1 mesos mesos 14967 Jan 3 13:00 001747.sst -rw-r--r-- 1 mesos mesos 7526 Jan 4 12:30 001756.sst -rw-r--r-- 1 mesos mesos 15113 Jan 5 10:08 001765.sst -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:09 001767.log -rw-r--r-- 1 mesos mesos16 Jan 5 10:08 CURRENT -rw-r--r-- 1 mesos mesos 0 Aug 25 2015 LOCK -rw-r--r-- 1 mesos mesos 178303865220 Jan 5 10:12 LOG -rw-r--r-- 1 mesos mesos 463093282 Jan 5 10:08 LOG.old -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:08 MANIFEST-001764 Monitoring indicates that the disk usage started to grow shortly after a badly coordinated configuration deployment change: · Master 1 was leading and restarted after a few hours of uptime · Master 2 was now leading. After a few seconds (30s-60s or so) it got restarted as well · Master 3 was now leading (and continued to do so) I have to admit I am a bit surprised that the restart scenario could lead to the issues described above. Has anyone seen similar issues as well? Thanks and best regards, Stephan
Mesos replicated log fills disk with logging output
Hi everyone, a few days ago, we have bumped into an interesting issue that we had not seen before. Essentially, one of our toy clusters dissolved itself: * 3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper (3.4.5) for leader election * Master 1 and master 2 had 100% disk usage, because /var/lib/mesos/replicated_log/LOG had grown to about 170 GB * The replicated log of both Master 1 and 2 was corrupted. A process restart did not fix it. * The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was caused by the full disk. * Master 3 was the leading Mesos master and healthy. Its disk usage was normal. The content of /var/lib/mesos/replicated_log/LOG was an endless stream of: 2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753 2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started 2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK 2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753 2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751 2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started 2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK 2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757 2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760 2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from (begin) .. (end); will stop at '003060' @ 9423 : 1 2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761 2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762 2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763 Content of the associated folder: /var/lib/mesos/replicated_log.corrupted# ls -la total 964480 drwxr-xr-x 2 mesos mesos 4096 Jan 5 10:12 . drwxr-xr-x 4 mesos mesos 4096 Jan 5 10:27 .. -rw-r--r-- 1 mesos mesos 724 Dec 14 16:22 001735.ldb -rw-r--r-- 1 mesos mesos 7393 Dec 14 16:45 001737.sst -rw-r--r-- 1 mesos mesos 22129 Jan 3 12:53 001742.sst -rw-r--r-- 1 mesos mesos 14967 Jan 3 13:00 001747.sst -rw-r--r-- 1 mesos mesos 7526 Jan 4 12:30 001756.sst -rw-r--r-- 1 mesos mesos 15113 Jan 5 10:08 001765.sst -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:09 001767.log -rw-r--r-- 1 mesos mesos16 Jan 5 10:08 CURRENT -rw-r--r-- 1 mesos mesos 0 Aug 25 2015 LOCK -rw-r--r-- 1 mesos mesos 178303865220 Jan 5 10:12 LOG -rw-r--r-- 1 mesos mesos 463093282 Jan 5 10:08 LOG.old -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:08 MANIFEST-001764 Monitoring indicates that the disk usage started to grow shortly after a badly coordinated configuration deployment change: * Master 1 was leading and restarted after a few hours of uptime * Master 2 was now leading. After a few seconds (30s-60s or so) it got restarted as well * Master 3 was now leading (and continued to do so) I have to admit I am a bit surprised that the restart scenario could lead to the issues described above. Has anyone seen similar issues as well? Thanks and best regards, Stephan
Re: Problems with OOM
Seems like there is a workaround: I can emulate my desired configuration to prevent swap usage, by disabling swap on the host and starting the slave without --cgroups_limit_swap. Then everything works as expected, i.e., a misbehaving task is killed immediately. However, I still don't know why 'cgroups_limit_swap' is not working as advertised. Best Regards, Stephan On 07.10.2014 12:29, Stephan Erb wrote: Ok, here is something odd. My kernel is booted using cgroup_enable=memory swapaccount=1 in order to enable cgroup accounting. The log for starting a new container: I1007 11:38:25.881882 3698 slave.cpp:1222] Queuing task '1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1' for executor thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 of framework '20140919-174559-16842879-5050-27194- I1007 11:38:25.891448 3696 cpushare.cpp:338] Updated 'cpu.shares' to 1280 (cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7 I1007 11:38:25.892354 3695 mem.cpp:479] Started listening for OOM events for container 866af1d4-14df-4e55-be5d-a54e2a573cd7 I1007 11:38:25.894224 3695 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7 I1007 11:38:25.897894 3695 mem.cpp:347] Updated 'memory.memsw.limit_in_bytes' to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7 I1007 11:38:25.901499 3693 linux_launcher.cpp:191] Cloning child process with flags = 0 I1007 11:38:25.982059 3693 containerizer.cpp:678] Checkpointing executor's forked pid 3985 to '/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/forked.pid' I1007 11:38:26.170440 3696 containerizer.cpp:510] Fetching URIs for container '866af1d4-14df-4e55-be5d-a54e2a573cd7' using command '/usr/local/libexec/mesos/mesos-fetcher' I1007 11:38:26.796327 3692 slave.cpp:2538] Monitoring executor 'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1' of framework '20140919-174559-16842879-5050-27194-' in container '866af1d4-14df-4e55-be5d-a54e2a573cd7' I1007 11:38:27.611901 3691 slave.cpp:1733] Got registration for executor 'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1' of framework 20140919-174559-16842879-5050-27194- from executor(1)@127.0.1.1:39709 I1007 11:38:27.612476 3691 slave.cpp:1819] Checkpointing executor pid 'executor(1)@127.0.1.1:39709' to '/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/libprocess.pid' I1007 11:38:27.614302 3691 slave.cpp:1853] Flushing queued task 1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 for executor 'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1' of framework 20140919-174559-16842879-5050-27194- I1007 11:38:27.615567 3697 cpushare.cpp:338] Updated 'cpu.shares' to 1280 (cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7 I1007 11:38:27.615622 3694 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7 I1007 11:38:27.630520 3694 slave.cpp:2088] Handling status update TASK_STARTING (UUID: 177f83dd-6669-4ead-8e42-95030e5723e4) for task 1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 of framework 20140919-174559-16842879-5050-27194- from executor(1)@127.0.1.1:39709 But when inspecting the limits of my container, they are not enforced as expected: # cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.soft_limit_in_bytes 658505728 # cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.limit_in_bytes 9223372036854775807 # cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.memsw.limit_in_bytes 9223372036854775807 Shouldn't the memsw.limit_in_bytes be set as well? Best Regards, Stephan On 06.10.2014 18:56, Stephan Erb wrote: Hello, I am still facing the same issue: * My process keeps allocating memory until all available system memory is used, but it is never killed. Its sandbox is limited to x00 MB but it ends up using several GB. * There is no OOM or cgroup related entry in dmesg (beside the initialization, i.e., Initializing cgroup subsys memory...) * The slave log contains nothing suspicious (see the attached logfile) Updating my Debian kernel from 3.2 to a backported 3.16 kernel did not help. The system is more responsive under load, but the OOM killer is still not triggered. I haven't tried running kernelshark on any of these kernels, yet. My used slave command line: /usr/local/sbin/mesos-slave --master=zk
Problems with OOM
Hi everyone, I am having issues with the cgroups isolation of Mesos. It seems like tasks are prevented from allocating more memory than their limit. However, they are never killed. * My scheduled task allocates memory in a tight loop. According to 'ps', once its memory requirements are exceeded it is not killed, but ends up in the state D (uninterruptible sleep (usually IO)). * The task is still considered running by Mesos. * There is no indication of an OOM in dmesg. * There is neither an OOM notice nor any other output related to the task in the slave log. * According to htop, the system load is increased with a significant portion of CPU time spend within the kernel. Commonly the load is so high that all zookeeper connections time out. I am running Aurora and Mesos 0.20.1 using the cgroups isolation on Debian 7 (kernel 3.2.60-1+deb7u3). . Sorry for the somewhat unspecific error description. Still, anyone an idea what might be wrong here? Thanks and Best Regards, Stephan
Mesos.interface python package
Hello, could the owner of https://pypi.python.org/pypi/mesos.interface please be so kind and upload latest version 0.20.1 to PyPi? Otherwise the (awesome) egg-files by Mesosphere cannot be installed. Thanks very much! Stephan
Re: Problems with OOM
@Tomas: I am currently only running a single slave in a VM. It uses the isolator and the logs are clean. @Tom: Thanks for the interesting hint! I will look into it. Best Regards, Stephan On Fr 26 Sep 2014 16:53:22 CEST, Tom Arnfeld wrote: I'm not sure if this at all related to the issue you're seeing, but we ran into this fun issue (or at least this seems to be the cause) helpfully documented on this blog article: http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html. TLDR: OOM killer getting into an infinite loop, causing the CPU to spin out of control on our VMs. More details in this commit message to the OOM killer earlier this year; https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0c740d0afc3bff0a097ad03a1c8df92757516f5c Hope this helps somewhat... On 26 September 2014 14:15, Tomas Barton barton.to...@gmail.com mailto:barton.to...@gmail.com wrote: Just to make sure, all slaves are running with: --isolation='cgroups/cpu,cgroups/mem' Is there something suspicious in mesos slave logs? On 26 September 2014 13:20, Stephan Erb stephan@blue-yonder.com mailto:stephan@blue-yonder.com wrote: Hi everyone, I am having issues with the cgroups isolation of Mesos. It seems like tasks are prevented from allocating more memory than their limit. However, they are never killed. * My scheduled task allocates memory in a tight loop. According to 'ps', once its memory requirements are exceeded it is not killed, but ends up in the state D (uninterruptible sleep (usually IO)). * The task is still considered running by Mesos. * There is no indication of an OOM in dmesg. * There is neither an OOM notice nor any other output related to the task in the slave log. * According to htop, the system load is increased with a significant portion of CPU time spend within the kernel. Commonly the load is so high that all zookeeper connections time out. I am running Aurora and Mesos 0.20.1 using the cgroups isolation on Debian 7 (kernel 3.2.60-1+deb7u3). . Sorry for the somewhat unspecific error description. Still, anyone an idea what might be wrong here? Thanks and Best Regards, Stephan
Re: Mesos 12.04 Python2.7 Egg
Did you find a solution for your question? I am currently having similar issues when trying to run the thermos executor on Debian 7, which doesn't ship GLIBC 2.16 either. Seems like we have to patch the Aurora build process (probably in 3rdparty/python/BUILD) to download the correct eggs form mesosphere.io instead of using the default ones on pypi. Does anyone have experience in how to do this? Thanks, Stephan On Sa 30 Aug 2014 08:08:24 CEST, Joe Smith wrote: Howdy all, I'm to migrating Apache Aurora http://aurora.incubator.apache.org/ to mesos 0.20.0[1][2], but am having an issue using the published dist on PyPI https://pypi.python.org/pypi/mesos.native/0.20.0. I also gave the mesosphere-provided (thank you!) egg http://mesosphere.io/downloads/#apache-mesos-0.20.0 for Ubuntu 12.04, and am getting the same stack trace: vagrant@192:~$ PYTHONPATH=/home/vagrant/.pex/install/mesos.native-0.20.0-py2.7-linux-x86_64.egg.be6632b790cd03172f858e7f875cdab4ef415ca5/mesos.native-0.20.0-py2.7-linux-x86_64.egg/mesos/ python2.7 Python 2.7.3 (default, Feb 27 2014, 19:58:35) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import mesos Traceback (most recent call last): File stdin, line 1, in module ImportError: No module named mesos import native Traceback (most recent call last): File stdin, line 1, in module File /home/vagrant/.pex/install/mesos.native-0.20.0-py2.7-linux-x86_64.egg.be6632b790cd03172f858e7f875cdab4ef415ca5/mesos.native-0.20.0-py2.7-linux-x86_64.egg/mesos/native/__init__.py, line 17, in module from ._mesos import MesosExecutorDriverImpl ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.16' not found (required by /home/vagrant/.pex/install/mesos.native-0.20.0-py2.7-linux-x86_64.egg.be6632b790cd03172f858e7f875cdab4ef415ca5/mesos.native-0.20.0-py2.7-linux-x86_64.egg/mesos/native/_mesos.so) It looks like the issue is it was built with a non-standard glibc (if I'm following right): vagrant@192:~/mesos-0.20.0$ /lib/x86_64-linux-gnu/libc.so.6 | grep release\ version GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10) stable release version 2.15, by Roland McGrath et al. Any feedback or suggestions would be greatly appreciated! Thanks, Joe [1] https://reviews.apache.org/r/25208/ [2] https://issues.apache.org/jira/browse/AURORA-674
Pitfalls when writing custom Frameworks
Hi everybody, I would like to assess the effort required to write a custom framework. Background: We have an application where we can start a flexible number of long-running worker processes performing number-crunching. The more processes the better. However, we have multiple users, each running an instance of the application and therefore competing for resources (as each tries to run as many worker processes as possible). For various reasons, we would like to run our application instances on top of mesos. There seem to be two ways to achieve this: A. Write a custom framework for our application that spawns the worker processes on demand. Each user gets to run one framework instance. We also need preemption of workers to achieve equality among frameworks. We could achieve this using an external entity monitoring all frameworks and telling to worst offenders to scale down a little. B. Instead of writing a framework, use a Service-Scheduler like Marathon, Aurora or Singularity to spawn the worker processes. Instead of just performing the scale-down, the external entity would dictate the number of worker processes for each application depending on its demand. The first choice seems to be the natural fit for Mesos. However, existing framework like Aurora seem to be battle-tested in regard to high availability, race conditions and issues like state reconciliation where the world view of scheduler and slaves are drifting apart. So this question boils down to: When considering to write a custom framework, which pitfalls do I have to be aware of? Can I come away with blindly implementing the scheduler API? Or do I always have to implement stuff like custom state-reconciliation in order to prevent orphaned tasks on slaves (for example, when my framework scheduler crashes or is temporarily unavailable)? Thanks for your input! Best Regards, Stephan