Re: Consequences of health-check timeouts?

Steven Schlansker Wed, 18 May 2016 10:52:27 -0700

> On May 18, 2016, at 10:44 AM, haosdent <haosd...@gmail.com> wrote:
> 
> >In re executor_shutdown_grace_period: how would this enable the task 
> >(MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo 
> >STDOUT as captured by Mesos shows that it received signal 15 just before it 
> >said good-bye). My naive understanding of this grace period is that it 
> >simply delays the termination of the executor.


I'm not 100% sure this is related or helpful, but be aware that we believe 
there is a bug in the Docker
containerizer's handling of logs during shutdown:

https://issues.apache.org/jira/browse/MESOS-5195

We spent a lot of time debugging why our application was not shutting down as 
we expected,
only to find that the real problem was that Mesos was losing all logs sent 
during
shutdown.

> 
> If you use DockerContainerizer, mesos use executor_shutdown_grace_period as 
> the shutdown gracefully timeout for task as well. If you use 
> MesosContainerizer, it would send SIGTERM(15) first. After 3 seconds, if the 
> task is still alive, Mesos would send SIGKILL(9) to the task again.
> 
> >I'm not sure what the java task is. This took place on the mesos-master node 
> >and none of our applications runs there. It runs master, Marathon, and ZK. 
> >Maybe the java task is Marathon or ZK?
> 
> Not sure about this, maybe others have similar experience on this, do 
> Marathon or Zookeeper abnormal at that time? Could you provide the log of 
> mesos-master/mesos-slave when accident happened as well?
> 
> 
> On Wed, May 18, 2016 at 7:11 PM, Paul Bell <arach...@gmail.com> wrote:
> Hi Hasodent,
> 
> Thanks for your reply.
> 
> In re executor_shutdown_grace_period: how would this enable the task 
> (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo 
> STDOUT as captured by Mesos shows that it received signal 15 just before it 
> said good-bye). My naive understanding of this grace period is that it simply 
> delays the termination of the executor.
> 
> The following snippet is rom /var/log/syslog. I believe it shows the stack 
> trace (largely in the kernel) that led to mesos-master being blocked for more 
> than 120 seconds. Please note that immediately above (before) the blocked 
> mesos-master is a blocked jbd2/dm. Immediately below (after) the blocked 
> mesos-master is a blocked java task. I'm not sure what the java task is. This 
> took place on the mesos-master node and none of our applications runs there. 
> It runs master, Marathon, and ZK. Maybe the java task is Marathon or ZK?
> 
> Thanks again.
> 
> -Paul
> May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013 
> blocked for more than 120 seconds.
> 
> May 16 20:06:53 71 kernel: [193339.890873]       Not tainted 
> 3.13.0-32-generic #57-Ubuntu
> 
> May 16 20:06:53 71 kernel: [193339.890889] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> May 16 20:06:53 71 kernel: [193339.890912] mesos-master    D ffff88013fd94440 
>     0  4013      1 0x00000000
> 
> May 16 20:06:53 71 kernel: [193339.890914]  ffff880137429a28 0000000000000002 
> ffff880135778000 ffff880137429fd8
> 
> May 16 20:06:53 71 kernel: [193339.890916]  0000000000014440 0000000000014440 
> ffff880135778000 ffff88013fd94cd8
> 
> May 16 20:06:53 71 kernel: [193339.890918]  ffff88013ffd34b0 0000000000000002 
> ffffffff81284630 ffff880137429aa0
> 
> May 16 20:06:53 71 kernel: [193339.890919] Call Trace:
> 
> May 16 20:06:53 71 kernel: [193339.890922]  [<ffffffff81284630>] ? 
> start_this_handle+0x590/0x590
> 
> May 16 20:06:53 71 kernel: [193339.890924]  [<ffffffff817203fd>] 
> io_schedule+0x9d/0x140
> 
> May 16 20:06:53 71 kernel: [193339.890925]  [<ffffffff8128463e>] 
> sleep_on_shadow_bh+0xe/0x20
> 
> May 16 20:06:53 71 kernel: [193339.890927]  [<ffffffff81720882>] 
> __wait_on_bit+0x62/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890929]  [<ffffffff81284630>] ? 
> start_this_handle+0x590/0x590
> 
> May 16 20:06:53 71 kernel: [193339.890930]  [<ffffffff81720927>] 
> out_of_line_wait_on_bit+0x77/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890932]  [<ffffffff810aaf40>] ? 
> autoremove_wake_function+0x40/0x40
> 
> May 16 20:06:53 71 kernel: [193339.890934]  [<ffffffff810aab15>] ? 
> wake_up_bit+0x25/0x30
> 
> May 16 20:06:53 71 kernel: [193339.890936]  [<ffffffff81285a3d>] 
> do_get_write_access+0x2ad/0x4f0
> 
> May 16 20:06:53 71 kernel: [193339.890938]  [<ffffffff811ef90d>] ? 
> __getblk+0x2d/0x2e0
> 
> May 16 20:06:53 71 kernel: [193339.890939]  [<ffffffff81285ca7>] 
> jbd2_journal_get_write_access+0x27/0x40
> 
> May 16 20:06:53 71 kernel: [193339.890942]  [<ffffffff8126c37b>] 
> __ext4_journal_get_write_access+0x3b/0x80
> 
> May 16 20:06:53 71 kernel: [193339.890946]  [<ffffffff81242200>] 
> ext4_reserve_inode_write+0x70/0xa0
> 
> May 16 20:06:53 71 kernel: [193339.890948]  [<ffffffff81245490>] ? 
> ext4_dirty_inode+0x40/0x60
> 
> May 16 20:06:53 71 kernel: [193339.890949]  [<ffffffff81242274>] 
> ext4_mark_inode_dirty+0x44/0x1f0
> 
> May 16 20:06:53 71 kernel: [193339.890951]  [<ffffffff81245490>] 
> ext4_dirty_inode+0x40/0x60
> 
> May 16 20:06:53 71 kernel: [193339.890953]  [<ffffffff811e65aa>] 
> __mark_inode_dirty+0x10a/0x2d0
> 
> May 16 20:06:53 71 kernel: [193339.890956]  [<ffffffff811d7761>] 
> update_time+0x81/0xd0
> 
> May 16 20:06:53 71 kernel: [193339.890957]  [<ffffffff811d7970>] 
> file_update_time+0x80/0xd0
> 
> May 16 20:06:53 71 kernel: [193339.890961]  [<ffffffff8114fc40>] 
> __generic_file_aio_write+0x180/0x3d0
> 
> May 16 20:06:53 71 kernel: [193339.890963]  [<ffffffff8114fee8>] 
> generic_file_aio_write+0x58/0xa0
> 
> May 16 20:06:53 71 kernel: [193339.890965]  [<ffffffff81239ea9>] 
> ext4_file_write+0x99/0x400
> 
> May 16 20:06:53 71 kernel: [193339.890967]  [<ffffffff8109a7f0>] ? 
> wake_up_state+0x10/0x20
> 
> May 16 20:06:53 71 kernel: [193339.890970]  [<ffffffff810d7c56>] ? 
> wake_futex+0x66/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890972]  [<ffffffff810d8c01>] ? 
> futex_wake+0x1b1/0x1d0
> 
> May 16 20:06:53 71 kernel: [193339.890974]  [<ffffffff811bc3da>] 
> do_sync_write+0x5a/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890976]  [<ffffffff811bcb64>] 
> vfs_write+0xb4/0x1f0
> 
> May 16 20:06:53 71 kernel: [193339.890978]  [<ffffffff811bd599>] 
> SyS_write+0x49/0xa0
> 
> May 16 20:06:53 71 kernel: [193339.890980]  [<ffffffff8172c87f>] 
> tracesys+0xe1/0xe6
> 
> 
> 
> 
> On Wed, May 18, 2016 at 2:33 AM, haosdent <haosd...@gmail.com> wrote:
> >Is there some way to be given control (a callback, or an "exit" routine) so 
> >that the container about to be nuked can be given a chance to exit 
> >gracefully?
> The default value of executor_shutdown_grace_period is 5 seconds, you could 
> change it by specify the `--executor_shutdown_grace_period` flag when launch 
> mesos agent.
> 
> >Are there other steps I can take to avoid this mildly calamitous occurrence?
> >mesos-slaves get shutdown
> Do you know where your mesos-master stuck when it happens? Any error log or 
> related log about this? In addition, is there any log when mesos-slave shut 
> down?
> 
> On Wed, May 18, 2016 at 6:12 AM, Paul Bell <arach...@gmail.com> wrote:
> Hi All,
> 
> I probably have the following account partly wrong, but let me present it 
> just the same and those who know better can correct me as needed.
> 
> I've an application that runs several MongoDB shards, each a Dockerized 
> container, each on a distinct node (VM); in fact, some of the VMs are on 
> separate ESXi hosts.
> 
> I've lately seen situations where, because of very slow disks for the 
> database, the following sequence occurs (I think):
>       • Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds 
> defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
>       • Synchronous flushing of many, many pages occurs, writing to a slow 
> disk
>       • (Around this time one might see in /var/log/syslog "task X blocked 
> for more than 120 seconds" for all kinds of tasks, including mesos-master)
>       • mesos-slaves get shutdown (this is the part I'm unclear about; but I 
> am quite certain that on 2 nodes the executors and their in-flight MongoDB 
> tasks got zapped because I can see that Marathon restarted them).
> The consequences of this are a corrupt MongoDB database. In the case at hand, 
> the job had run for over 50 hours, processing close to 120 million files.
> 
> Steps I've taken so far to remedy include:
>       • tune vm.dirty_background_ratio and vm.dirty_ratio down, respectively, 
> to 5 and 10 (from 10 and 20). The intent here is to tolerate more frequent, 
> smaller flushes and thus avoid less frequent massive flushes that suspend 
> threads for very long periods.
>       • increase agent ping timeout to 10 minutes (every 30 seconds, 20 times)
> So the questions are:
>       • Is there some way to be given control (a callback, or an "exit" 
> routine) so that the container about to be nuked can be given a chance to 
> exit gracefully?
>       • Are there other steps I can take to avoid this mildly calamitous 
> occurrence?
>       • (Also, I'd be grateful for more clarity on anything in steps 1-4 
> above that is a bit hand-wavy!)
> As always, thanks.
> 
> -Paul
> 
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Consequences of health-check timeouts?

Reply via email to