Hi Hasodent, Thanks for your reply.
In re executor_shutdown_grace_period: how would this enable the task (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo STDOUT as captured by Mesos shows that it received signal 15 just before it said good-bye). My naive understanding of this grace period is that it simply delays the termination of the executor. The following snippet is rom /var/log/syslog. I believe it shows the stack trace (largely in the kernel) that led to mesos-master being blocked for more than 120 seconds. Please note that immediately above (before) the blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the blocked mesos-master is a blocked java task. I'm not sure what the java task is. This took place on the mesos-master node and none of our applications runs there. It runs master, Marathon, and ZK. Maybe the java task is Marathon or ZK? Thanks again. -Paul May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013 blocked for more than 120 seconds. May 16 20:06:53 71 kernel: [193339.890873] Not tainted 3.13.0-32-generic #57-Ubuntu May 16 20:06:53 71 kernel: [193339.890889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 16 20:06:53 71 kernel: [193339.890912] mesos-master D ffff88013fd94440 0 4013 1 0x00000000 May 16 20:06:53 71 kernel: [193339.890914] ffff880137429a28 0000000000000002 ffff880135778000 ffff880137429fd8 May 16 20:06:53 71 kernel: [193339.890916] 0000000000014440 0000000000014440 ffff880135778000 ffff88013fd94cd8 May 16 20:06:53 71 kernel: [193339.890918] ffff88013ffd34b0 0000000000000002 ffffffff81284630 ffff880137429aa0 May 16 20:06:53 71 kernel: [193339.890919] Call Trace: May 16 20:06:53 71 kernel: [193339.890922] [<ffffffff81284630>] ? start_this_handle+0x590/0x590 May 16 20:06:53 71 kernel: [193339.890924] [<ffffffff817203fd>] io_schedule+0x9d/0x140 May 16 20:06:53 71 kernel: [193339.890925] [<ffffffff8128463e>] sleep_on_shadow_bh+0xe/0x20 May 16 20:06:53 71 kernel: [193339.890927] [<ffffffff81720882>] __wait_on_bit+0x62/0x90 May 16 20:06:53 71 kernel: [193339.890929] [<ffffffff81284630>] ? start_this_handle+0x590/0x590 May 16 20:06:53 71 kernel: [193339.890930] [<ffffffff81720927>] out_of_line_wait_on_bit+0x77/0x90 May 16 20:06:53 71 kernel: [193339.890932] [<ffffffff810aaf40>] ? autoremove_wake_function+0x40/0x40 May 16 20:06:53 71 kernel: [193339.890934] [<ffffffff810aab15>] ? wake_up_bit+0x25/0x30 May 16 20:06:53 71 kernel: [193339.890936] [<ffffffff81285a3d>] do_get_write_access+0x2ad/0x4f0 May 16 20:06:53 71 kernel: [193339.890938] [<ffffffff811ef90d>] ? __getblk+0x2d/0x2e0 May 16 20:06:53 71 kernel: [193339.890939] [<ffffffff81285ca7>] jbd2_journal_get_write_access+0x27/0x40 May 16 20:06:53 71 kernel: [193339.890942] [<ffffffff8126c37b>] __ext4_journal_get_write_access+0x3b/0x80 May 16 20:06:53 71 kernel: [193339.890946] [<ffffffff81242200>] ext4_reserve_inode_write+0x70/0xa0 May 16 20:06:53 71 kernel: [193339.890948] [<ffffffff81245490>] ? ext4_dirty_inode+0x40/0x60 May 16 20:06:53 71 kernel: [193339.890949] [<ffffffff81242274>] ext4_mark_inode_dirty+0x44/0x1f0 May 16 20:06:53 71 kernel: [193339.890951] [<ffffffff81245490>] ext4_dirty_inode+0x40/0x60 May 16 20:06:53 71 kernel: [193339.890953] [<ffffffff811e65aa>] __mark_inode_dirty+0x10a/0x2d0 May 16 20:06:53 71 kernel: [193339.890956] [<ffffffff811d7761>] update_time+0x81/0xd0 May 16 20:06:53 71 kernel: [193339.890957] [<ffffffff811d7970>] file_update_time+0x80/0xd0 May 16 20:06:53 71 kernel: [193339.890961] [<ffffffff8114fc40>] __generic_file_aio_write+0x180/0x3d0 May 16 20:06:53 71 kernel: [193339.890963] [<ffffffff8114fee8>] generic_file_aio_write+0x58/0xa0 May 16 20:06:53 71 kernel: [193339.890965] [<ffffffff81239ea9>] ext4_file_write+0x99/0x400 May 16 20:06:53 71 kernel: [193339.890967] [<ffffffff8109a7f0>] ? wake_up_state+0x10/0x20 May 16 20:06:53 71 kernel: [193339.890970] [<ffffffff810d7c56>] ? wake_futex+0x66/0x90 May 16 20:06:53 71 kernel: [193339.890972] [<ffffffff810d8c01>] ? futex_wake+0x1b1/0x1d0 May 16 20:06:53 71 kernel: [193339.890974] [<ffffffff811bc3da>] do_sync_write+0x5a/0x90 May 16 20:06:53 71 kernel: [193339.890976] [<ffffffff811bcb64>] vfs_write+0xb4/0x1f0 May 16 20:06:53 71 kernel: [193339.890978] [<ffffffff811bd599>] SyS_write+0x49/0xa0 May 16 20:06:53 71 kernel: [193339.890980] [<ffffffff8172c87f>] tracesys+0xe1/0xe6 On Wed, May 18, 2016 at 2:33 AM, haosdent <haosd...@gmail.com> wrote: > >Is there some way to be given control (a callback, or an "exit" routine) > so that the container about to be nuked can be given a chance to exit > gracefully? > The default value of executor_shutdown_grace_period is 5 seconds, you > could change it by specify the `--executor_shutdown_grace_period` flag when > launch mesos agent. > > >Are there other steps I can take to avoid this mildly calamitous > occurrence? > >mesos-slaves get shutdown > Do you know where your mesos-master stuck when it happens? Any error log > or related log about this? In addition, is there any log when mesos-slave > shut down? > > On Wed, May 18, 2016 at 6:12 AM, Paul Bell <arach...@gmail.com> wrote: > >> Hi All, >> >> I probably have the following account partly wrong, but let me present it >> just the same and those who know better can correct me as needed. >> >> I've an application that runs several MongoDB shards, each a Dockerized >> container, each on a distinct node (VM); in fact, some of the VMs are on >> separate ESXi hosts. >> >> I've lately seen situations where, because of very slow disks for the >> database, the following sequence occurs (I think): >> >> 1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds >> defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both) >> 2. Synchronous flushing of many, many pages occurs, writing to a slow >> disk >> 3. (Around this time one might see in /var/log/syslog "task X blocked >> for more than 120 seconds" for all kinds of tasks, including mesos-master) >> 4. mesos-slaves get shutdown (this is the part I'm unclear about; but >> I am quite certain that on 2 nodes the executors and their in-flight >> MongoDB tasks got zapped because I can see that Marathon restarted them). >> >> The consequences of this are a corrupt MongoDB database. In the case at >> hand, the job had run for over 50 hours, processing close to 120 million >> files. >> >> Steps I've taken so far to remedy include: >> >> - tune vm.dirty_background_ratio and vm.dirty_ratio down, >> respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate >> more frequent, smaller flushes and thus avoid less frequent massive >> flushes >> that suspend threads for very long periods. >> - increase agent ping timeout to 10 minutes (every 30 seconds, 20 >> times) >> >> So the questions are: >> >> - Is there some way to be given control (a callback, or an "exit" >> routine) so that the container about to be nuked can be given a chance to >> exit gracefully? >> - Are there other steps I can take to avoid this mildly calamitous >> occurrence? >> - (Also, I'd be grateful for more clarity on anything in steps 1-4 >> above that is a bit hand-wavy!) >> >> As always, thanks. >> >> -Paul >> >> >> > > > -- > Best Regards, > Haosdent Huang >