Re: [ceph-users] deep scrubbing causes osd down
On 13 April 2015 at 16:00, Christian Balzer ch...@gol.com wrote: However the vast majority of people with production clusters will be running something stable, mostly Firefly at this moment. Sorry, 0.87 is giant. BTW, you could also set osd_scrub_sleep to your cluster. ceph would sleep some time as you defined when it has scrub some objects. But I am not sure whether is could works good to you. Yeah, that bit is backported to Firefly and can definitely help, however the suggested initial value is too small for most people who have scrub issues, starting with 0.5 seconds and see how it goes seems to work better. Thanks xinze, Christian. Yah, I'm on 0.87 in production - I can wait for the next release :) In the meantime, from the prior msgs I've set this: [osd] osd_scrub_chunk_min = 1 osd_scrub_chunk_max = 5 osd_scrub_sleep = 0.5 Do the values look ok? is the [osd] section the right spot? Thanks - Lindsay -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] deep scrubbing causes osd down
Sorry, I am not sure whether it is look ok in your production environment. Maybe you could use the command: ceph tell osd.0 injectargs -osd_scrub_sleep 0.5 . This command would affect only one osd. If it works fine for some days, you could set for all osd. This is just a suggestion. 2015-04-13 14:34 GMT+08:00 Lindsay Mathieson lindsay.mathie...@gmail.com: On 13 April 2015 at 16:00, Christian Balzer ch...@gol.com wrote: However the vast majority of people with production clusters will be running something stable, mostly Firefly at this moment. Sorry, 0.87 is giant. BTW, you could also set osd_scrub_sleep to your cluster. ceph would sleep some time as you defined when it has scrub some objects. But I am not sure whether is could works good to you. Yeah, that bit is backported to Firefly and can definitely help, however the suggested initial value is too small for most people who have scrub issues, starting with 0.5 seconds and see how it goes seems to work better. Thanks xinze, Christian. Yah, I'm on 0.87 in production - I can wait for the next release :) In the meantime, from the prior msgs I've set this: [osd] osd_scrub_chunk_min = 1 osd_scrub_chunk_max = 5 osd_scrub_sleep = 0.5 Do the values look ok? is the [osd] section the right spot? Thanks - Lindsay -- Lindsay -- Regards, xinze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] deep scrubbing causes osd down
hi, Loic: Do you think it is patch https://github.com/ceph/ceph/pull/3318 worth of backport to firely and giant? 2015-04-13 14:00 GMT+08:00 Christian Balzer ch...@gol.com: On Mon, 13 Apr 2015 13:42:39 +0800 池信泽 wrote: I knew the scheduler was in the pipeline, good to see it made it in. However the vast majority of people with production clusters will be running something stable, mostly Firefly at this moment. Sorry, 0.87 is giant. BTW, you could also set osd_scrub_sleep to your cluster. ceph would sleep some time as you defined when it has scrub some objects. But I am not sure whether is could works good to you. Yeah, that bit is backported to Firefly and can definitely help, however the suggested initial value is too small for most people who have scrub issues, starting with 0.5 seconds and see how it goes seems to work better. Christian Thanks. 2015-04-13 13:30 GMT+08:00 池信泽 xmdx...@gmail.com: hi, you could restrict scrub to certain times of day based on https://github.com/ceph/ceph/pull/3318. You could set osd_scrub_begin_hour and osd_scrub_begin_hour which are suitable for you. This feature is available since 0.93. But it has not been backport to 0.87 (hammer). 2015-04-13 12:55 GMT+08:00 Lindsay Mathieson lindsay.mathie...@gmail.com: On 13 April 2015 at 11:02, Christian Balzer ch...@gol.com wrote: Yeah, that's a request/question that comes up frequently. And so far there's no option in Ceph to do that (AFAIK), it would be really nice along with scheduling options (don't scrub during peak hours), which have also been talked about. I was just about to post a question on that ... :) Just had devs and support bitching to me that all their VM's were running like dogs (they were). Ceph decided to run a deep scrub Monday afternoon. It boggles me that there are no options for controlling the schedule, considering how critically important the timing is. I'd be happy to run a deep scrub ever night at 1am. Anytime during the day is a disaster. So currently I have noscrub and nodeep-scrub set which is really less than ideal. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Regards, xinze -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- Regards, xinze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] deep scrubbing causes osd down
On Mon, 13 Apr 2015 13:42:39 +0800 池信泽 wrote: I knew the scheduler was in the pipeline, good to see it made it in. However the vast majority of people with production clusters will be running something stable, mostly Firefly at this moment. Sorry, 0.87 is giant. BTW, you could also set osd_scrub_sleep to your cluster. ceph would sleep some time as you defined when it has scrub some objects. But I am not sure whether is could works good to you. Yeah, that bit is backported to Firefly and can definitely help, however the suggested initial value is too small for most people who have scrub issues, starting with 0.5 seconds and see how it goes seems to work better. Christian Thanks. 2015-04-13 13:30 GMT+08:00 池信泽 xmdx...@gmail.com: hi, you could restrict scrub to certain times of day based on https://github.com/ceph/ceph/pull/3318. You could set osd_scrub_begin_hour and osd_scrub_begin_hour which are suitable for you. This feature is available since 0.93. But it has not been backport to 0.87 (hammer). 2015-04-13 12:55 GMT+08:00 Lindsay Mathieson lindsay.mathie...@gmail.com: On 13 April 2015 at 11:02, Christian Balzer ch...@gol.com wrote: Yeah, that's a request/question that comes up frequently. And so far there's no option in Ceph to do that (AFAIK), it would be really nice along with scheduling options (don't scrub during peak hours), which have also been talked about. I was just about to post a question on that ... :) Just had devs and support bitching to me that all their VM's were running like dogs (they were). Ceph decided to run a deep scrub Monday afternoon. It boggles me that there are no options for controlling the schedule, considering how critically important the timing is. I'd be happy to run a deep scrub ever night at 1am. Anytime during the day is a disaster. So currently I have noscrub and nodeep-scrub set which is really less than ideal. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Regards, xinze -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] deep scrubbing causes osd down
On 13 April 2015 at 11:02, Christian Balzer ch...@gol.com wrote: Yeah, that's a request/question that comes up frequently. And so far there's no option in Ceph to do that (AFAIK), it would be really nice along with scheduling options (don't scrub during peak hours), which have also been talked about. I was just about to post a question on that ... :) Just had devs and support bitching to me that all their VM's were running like dogs (they were). Ceph decided to run a deep scrub Monday afternoon. It boggles me that there are no options for controlling the schedule, considering how critically important the timing is. I'd be happy to run a deep scrub ever night at 1am. Anytime during the day is a disaster. So currently I have noscrub and nodeep-scrub set which is really less than ideal. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] deep scrubbing causes osd down
Hello, On Sun, 12 Apr 2015 22:01:06 +0100 (BST) Andrei Mikhailovsky wrote: JC, Thanks I think the max scrub option that you refer to is a value per osd and not per cluster. So, the default is not to run more than 1 scrub per osd. So, if you have 100 osds by default it will not run more than 100 scurb processes at the same time. However, I want to limit the number on a cluster basis rather than on an osd basis. Yeah, that's a request/question that comes up frequently. And so far there's no option in Ceph to do that (AFAIK), it would be really nice along with scheduling options (don't scrub during peak hours), which have also been talked about. So for you right now, the manual scheduling is the best solution as pointed out by Jean-Charles. If your cluster is fast enough to finish a full deep-scrub in a night, you can even forgo the cron jobs and setting the cluster to nodeep-scrub. I have a cluster where I kicked off a manual deep-scrub of all OSDs on a Saturday morning at 0:05 and it finished at 3:25. So now, with the default scrub intervals, all subsequent deep scrubs happen in the same time frame (weekly) and all normal scrubs in the same time as well (daily, unless it'd deep scrub day). Of course ultimately you really want to get your cluster into a shape where scrubbing doesn't make it unstable by improving your hardware and scale. Christian Andrei - Original Message - From: Jean-Charles Lopez jelo...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users@lists.ceph.com Sent: Sunday, 12 April, 2015 5:17:10 PM Subject: Re: [ceph-users] deep scrubbing causes osd down Hi andrei There is one parameter, osd_max_scrub I think, that controls the number of scrubs per OSD. But the default is 1 if I'm correct. Can you check on one of your OSDs with the admin socket? Then it remains the option of scheduling the deep scrubs via a cron job after setting nodeep-scrub to prevent automatic deep scrubbing. Dan Van Der Ster had a post on this ML regarding this. JC While moving. Excuse unintended typos. On Apr 12, 2015, at 05:21, Andrei Mikhailovsky and...@arhont.com wrote: JC, the restart of the osd servers seems to have stabilised the cluster. It has been a few hours since the restart and I haven't not seen a single osd disconnect. Is there a way to limit the total number of scrub and/or deep-scrub processes running at the same time? For instance, I do not want to have more than 1 or 2 scrub/deep-scrubs running at the same time on my cluster. How do I implement this? Thanks Andrei - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: LOPEZ Jean-Charles jelo...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Sunday, 12 April, 2015 9:02:05 AM Subject: Re: [ceph-users] deep scrubbing causes osd down JC, I've implemented the following changes to the ceph.conf and restarted mons and osds. osd_scrub_chunk_min = 1 osd_scrub_chunk_max =5 Things have become considerably worse after the changes. Shortly after doing that, majority of osd processes started taking up over 100% cpu and the cluster has considerably slowed down. All my vms are reporting high IO wait (between 30-80%), even vms which are pretty idle and don't do much. i have tried restarting all osds, but shortly after the restart the cpu usage goes up. The osds are showing the following logs: 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log [WRN] : 7 slow requests, 1 included below; oldest blocked for 68.278951 secs 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0228 [set-alloc
Re: [ceph-users] deep scrubbing causes osd down
Sorry, 0.87 is giant. BTW, you could also set osd_scrub_sleep to your cluster. ceph would sleep some time as you defined when it has scrub some objects. But I am not sure whether is could works good to you. Thanks. 2015-04-13 13:30 GMT+08:00 池信泽 xmdx...@gmail.com: hi, you could restrict scrub to certain times of day based on https://github.com/ceph/ceph/pull/3318. You could set osd_scrub_begin_hour and osd_scrub_begin_hour which are suitable for you. This feature is available since 0.93. But it has not been backport to 0.87 (hammer). 2015-04-13 12:55 GMT+08:00 Lindsay Mathieson lindsay.mathie...@gmail.com: On 13 April 2015 at 11:02, Christian Balzer ch...@gol.com wrote: Yeah, that's a request/question that comes up frequently. And so far there's no option in Ceph to do that (AFAIK), it would be really nice along with scheduling options (don't scrub during peak hours), which have also been talked about. I was just about to post a question on that ... :) Just had devs and support bitching to me that all their VM's were running like dogs (they were). Ceph decided to run a deep scrub Monday afternoon. It boggles me that there are no options for controlling the schedule, considering how critically important the timing is. I'd be happy to run a deep scrub ever night at 1am. Anytime during the day is a disaster. So currently I have noscrub and nodeep-scrub set which is really less than ideal. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Regards, xinze -- Regards, xinze ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] deep scrubbing causes osd down
JC, I've implemented the following changes to the ceph.conf and restarted mons and osds. osd_scrub_chunk_min = 1 osd_scrub_chunk_max =5 Things have become considerably worse after the changes. Shortly after doing that, majority of osd processes started taking up over 100% cpu and the cluster has considerably slowed down. All my vms are reporting high IO wait (between 30-80%), even vms which are pretty idle and don't do much. i have tried restarting all osds, but shortly after the restart the cpu usage goes up. The osds are showing the following logs: 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log [WRN] : 7 slow requests, 1 included below; oldest blocked for 68.278951 secs 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0228 [set-alloc-hint object_size 4194304 write_size 4194304,write 3584000~69632] 5.30418007 ack+ondisk+write+known_if_redirected e74834) currently waiting for subops from 9 2015-04-12 08:40:43.570004 7f96dd693700 0 cls cls/rgw/cls_rgw.cc:1458: gc_iterate_entries end_key=1_01428824443.569998000 [In total i've got around 40,000 slow request entries accumulated overnight ((( ] On top of that, I have reports of osds going down and back up as frequently as every 10-20 minutes. This effects all osds and not a particular set of osds. I will restart the osd servers to see if it makes a difference, otherwise, I will need to revert back to the default settings as the cluster as it currently is is not functional. Andrei - Original Message - From: LOPEZ Jean-Charles jelo...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: LOPEZ Jean-Charles jelo...@redhat.com, ceph-users@lists.ceph.com Sent: Saturday, 11 April, 2015 7:54:18 PM Subject: Re: [ceph-users] deep scrubbing causes osd down Hi Andrei, 1) what ceph version are you running? 2) what distro and version are you running? 3) have you checked the disk elevator for the OSD devices to be set to cfq? 4) Have have you considered exploring the following parameters to further tune - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 - osd_deep_scrub_stride If you have lowered parameters above, you can play with this one to fit best your physical disk behaviour. - osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. = 0.5 to start with a half second delay Cheers JC On 10 Apr 2015, at 12:01, Andrei Mikhailovsky and...@arhont.com wrote: Hi guys, I was wondering if anyone noticed that the deep scrubbing process causes some osd to go down? I have been keeping an eye on a few remaining stability issues in my test cluster. One of the unsolved issues is the occasional reporting of osd(s) going down and coming back up after about 20-30 seconds. This happens to various osds throughout the cluster. I have a small cluster of just 2 osd servers with 9 osds each. The common trend that i see week after week is that whenever there is a long deep scrubbing activity on the cluster it triggers one or more osds to go down for a short period of time. After the osd is marked down, it goes back up after about 20 seconds. Obviously there is a repair process that kicks in which causes more load on the cluster. While looking at the logs, i've not seen the osds being marked down when the cluster is not deep scrubbing. It _always_ happens when there is a deep scrub activity. I am seeing the reports of osds going down about 3-4 times a week. The latest happened just recently with the following log entries: 2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 3441533 : cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 1005 B/s rd, 1005 B/s wr, 0 op/s 2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 3441542 : cluster [INF] osd.6
Re: [ceph-users] deep scrubbing causes osd down
JC, the restart of the osd servers seems to have stabilised the cluster. It has been a few hours since the restart and I haven't not seen a single osd disconnect. Is there a way to limit the total number of scrub and/or deep-scrub processes running at the same time? For instance, I do not want to have more than 1 or 2 scrub/deep-scrubs running at the same time on my cluster. How do I implement this? Thanks Andrei - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: LOPEZ Jean-Charles jelo...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Sunday, 12 April, 2015 9:02:05 AM Subject: Re: [ceph-users] deep scrubbing causes osd down JC, I've implemented the following changes to the ceph.conf and restarted mons and osds. osd_scrub_chunk_min = 1 osd_scrub_chunk_max =5 Things have become considerably worse after the changes. Shortly after doing that, majority of osd processes started taking up over 100% cpu and the cluster has considerably slowed down. All my vms are reporting high IO wait (between 30-80%), even vms which are pretty idle and don't do much. i have tried restarting all osds, but shortly after the restart the cpu usage goes up. The osds are showing the following logs: 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log [WRN] : 7 slow requests, 1 included below; oldest blocked for 68.278951 secs 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0228 [set-alloc-hint object_size 4194304 write_size 4194304,write 3584000~69632] 5.30418007 ack+ondisk+write+known_if_redirected e74834) currently waiting for subops from 9 2015-04-12 08:40:43.570004 7f96dd693700 0 cls cls/rgw/cls_rgw.cc:1458: gc_iterate_entries end_key=1_01428824443.569998000 [In total i've got around 40,000 slow request entries accumulated overnight ((( ] On top of that, I have reports of osds going down and back up as frequently as every 10-20 minutes. This effects all osds and not a particular set of osds. I will restart the osd servers to see if it makes a difference, otherwise, I will need to revert back to the default settings as the cluster as it currently is is not functional. Andrei - Original Message - From: LOPEZ Jean-Charles jelo...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: LOPEZ Jean-Charles jelo...@redhat.com, ceph-users@lists.ceph.com Sent: Saturday, 11 April, 2015 7:54:18 PM Subject: Re: [ceph-users] deep scrubbing causes osd down Hi Andrei, 1) what ceph version are you running? 2) what distro and version are you running? 3) have you checked the disk elevator for the OSD devices to be set to cfq? 4) Have have you considered exploring the following parameters to further tune - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 - osd_deep_scrub_stride If you have lowered parameters above, you can play with this one to fit best your physical disk behaviour. - osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. = 0.5 to start with a half second delay Cheers JC On 10 Apr 2015, at 12:01, Andrei Mikhailovsky and...@arhont.com wrote: Hi guys, I was wondering if anyone noticed that the deep scrubbing process causes some osd to go down? I have been keeping an eye on a few remaining stability issues in my test cluster. One of the unsolved issues is the occasional reporting of osd(s) going down and coming back up after about 20-30 seconds. This happens to various osds throughout the cluster. I have a small cluster of just 2 osd servers with 9 osds each. The common trend that i see week after week is that whenever there is a long deep scrubbing activity on the cluster it triggers one or more osds to go down for a short period of time. After the osd is marked down, it goes back
Re: [ceph-users] deep scrubbing causes osd down
Hi andrei There is one parameter, osd_max_scrub I think, that controls the number of scrubs per OSD. But the default is 1 if I'm correct. Can you check on one of your OSDs with the admin socket? Then it remains the option of scheduling the deep scrubs via a cron job after setting nodeep-scrub to prevent automatic deep scrubbing. Dan Van Der Ster had a post on this ML regarding this. JC While moving. Excuse unintended typos. On Apr 12, 2015, at 05:21, Andrei Mikhailovsky and...@arhont.com wrote: JC, the restart of the osd servers seems to have stabilised the cluster. It has been a few hours since the restart and I haven't not seen a single osd disconnect. Is there a way to limit the total number of scrub and/or deep-scrub processes running at the same time? For instance, I do not want to have more than 1 or 2 scrub/deep-scrubs running at the same time on my cluster. How do I implement this? Thanks Andrei From: Andrei Mikhailovsky and...@arhont.com To: LOPEZ Jean-Charles jelo...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Sunday, 12 April, 2015 9:02:05 AM Subject: Re: [ceph-users] deep scrubbing causes osd down JC, I've implemented the following changes to the ceph.conf and restarted mons and osds. osd_scrub_chunk_min = 1 osd_scrub_chunk_max =5 Things have become considerably worse after the changes. Shortly after doing that, majority of osd processes started taking up over 100% cpu and the cluster has considerably slowed down. All my vms are reporting high IO wait (between 30-80%), even vms which are pretty idle and don't do much. i have tried restarting all osds, but shortly after the restart the cpu usage goes up. The osds are showing the following logs: 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log [WRN] : 7 slow requests, 1 included below; oldest blocked for 68.278951 secs 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0228 [set-alloc-hint object_size 4194304 write_size 4194304,write 3584000~69632] 5.30418007 ack+ondisk+write+known_if_redirected e74834) currently waiting for subops from 9 2015-04-12 08:40:43.570004 7f96dd693700 0 cls cls/rgw/cls_rgw.cc:1458: gc_iterate_entries end_key=1_01428824443.569998000 [In total i've got around 40,000 slow request entries accumulated overnight ((( ] On top of that, I have reports of osds going down and back up as frequently as every 10-20 minutes. This effects all osds and not a particular set of osds. I will restart the osd servers to see if it makes a difference, otherwise, I will need to revert back to the default settings as the cluster as it currently is is not functional. Andrei From: LOPEZ Jean-Charles jelo...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: LOPEZ Jean-Charles jelo...@redhat.com, ceph-users@lists.ceph.com Sent: Saturday, 11 April, 2015 7:54:18 PM Subject: Re: [ceph-users] deep scrubbing causes osd down Hi Andrei, 1) what ceph version are you running? 2) what distro and version are you running? 3) have you checked the disk elevator for the OSD devices to be set to cfq? 4) Have have you considered exploring the following parameters to further tune - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 - osd_deep_scrub_stride If you have lowered parameters above, you can play with this one to fit best your physical disk behaviour. - osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. = 0.5 to start with a half second delay Cheers JC On 10 Apr 2015, at 12:01, Andrei Mikhailovsky and...@arhont.com wrote: Hi guys, I was wondering if anyone noticed that the deep scrubbing process causes some osd to go down? I have been keeping an eye on a few remaining stability issues in my test cluster. One of the unsolved issues is the occasional reporting of osd(s) going down
Re: [ceph-users] deep scrubbing causes osd down
JC, Thanks I think the max scrub option that you refer to is a value per osd and not per cluster. So, the default is not to run more than 1 scrub per osd. So, if you have 100 osds by default it will not run more than 100 scurb processes at the same time. However, I want to limit the number on a cluster basis rather than on an osd basis. Andrei - Original Message - From: Jean-Charles Lopez jelo...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users@lists.ceph.com Sent: Sunday, 12 April, 2015 5:17:10 PM Subject: Re: [ceph-users] deep scrubbing causes osd down Hi andrei There is one parameter, osd_max_scrub I think, that controls the number of scrubs per OSD. But the default is 1 if I'm correct. Can you check on one of your OSDs with the admin socket? Then it remains the option of scheduling the deep scrubs via a cron job after setting nodeep-scrub to prevent automatic deep scrubbing. Dan Van Der Ster had a post on this ML regarding this. JC While moving. Excuse unintended typos. On Apr 12, 2015, at 05:21, Andrei Mikhailovsky and...@arhont.com wrote: JC, the restart of the osd servers seems to have stabilised the cluster. It has been a few hours since the restart and I haven't not seen a single osd disconnect. Is there a way to limit the total number of scrub and/or deep-scrub processes running at the same time? For instance, I do not want to have more than 1 or 2 scrub/deep-scrubs running at the same time on my cluster. How do I implement this? Thanks Andrei - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: LOPEZ Jean-Charles jelo...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Sunday, 12 April, 2015 9:02:05 AM Subject: Re: [ceph-users] deep scrubbing causes osd down JC, I've implemented the following changes to the ceph.conf and restarted mons and osds. osd_scrub_chunk_min = 1 osd_scrub_chunk_max =5 Things have become considerably worse after the changes. Shortly after doing that, majority of osd processes started taking up over 100% cpu and the cluster has considerably slowed down. All my vms are reporting high IO wait (between 30-80%), even vms which are pretty idle and don't do much. i have tried restarting all osds, but shortly after the restart the cpu usage goes up. The osds are showing the following logs: 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.277590 seconds old, received at 2015-04-12 08:38:28.576168: osd_op(client.69637439.0:290325926 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 60.246943 seconds old, received at 2015-04-12 08:38:28.606815: osd_op(client.69637439.0:290325927 rbd_data.265f967a5f7514.4a00 [set-alloc-hint object_size 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently waiting for missing object 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) log [WRN] : 7 slow requests, 1 included below; oldest blocked for 68.278951 secs 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) log [WRN] : slow request 30.268450 seconds old, received at 2015-04-12 08:39:06.586669: osd_op(client.64965167.0:1607510 rbd_data.1f264b2ae8944a.0228 [set-alloc-hint object_size 4194304 write_size 4194304,write 3584000~69632] 5.30418007 ack+ondisk+write+known_if_redirected e74834) currently waiting for subops from 9 2015-04-12 08:40:43.570004 7f96dd693700 0 cls cls/rgw/cls_rgw.cc:1458: gc_iterate_entries end_key=1_01428824443.569998000 [In total i've got around 40,000 slow request entries accumulated overnight ((( ] On top of that, I have reports of osds going down and back up as frequently as every 10-20 minutes. This effects all osds and not a particular set of osds. I will restart the osd servers to see if it makes a difference, otherwise, I will need to revert back to the default settings as the cluster as it currently is is not functional. Andrei - Original Message - From: LOPEZ Jean-Charles jelo...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: LOPEZ Jean-Charles jelo...@redhat.com , ceph-users@lists.ceph.com Sent: Saturday, 11 April, 2015 7:54:18 PM Subject: Re: [ceph-users] deep scrubbing causes osd down Hi Andrei
Re: [ceph-users] deep scrubbing causes osd down
Hi Andrei, 1) what ceph version are you running? 2) what distro and version are you running? 3) have you checked the disk elevator for the OSD devices to be set to cfq? 4) Have have you considered exploring the following parameters to further tune - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 - osd_deep_scrub_stride If you have lowered parameters above, you can play with this one to fit best your physical disk behaviour. - osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. = 0.5 to start with a half second delay Cheers JC On 10 Apr 2015, at 12:01, Andrei Mikhailovsky and...@arhont.com wrote: Hi guys, I was wondering if anyone noticed that the deep scrubbing process causes some osd to go down? I have been keeping an eye on a few remaining stability issues in my test cluster. One of the unsolved issues is the occasional reporting of osd(s) going down and coming back up after about 20-30 seconds. This happens to various osds throughout the cluster. I have a small cluster of just 2 osd servers with 9 osds each. The common trend that i see week after week is that whenever there is a long deep scrubbing activity on the cluster it triggers one or more osds to go down for a short period of time. After the osd is marked down, it goes back up after about 20 seconds. Obviously there is a repair process that kicks in which causes more load on the cluster. While looking at the logs, i've not seen the osds being marked down when the cluster is not deep scrubbing. It _always_ happens when there is a deep scrub activity. I am seeing the reports of osds going down about 3-4 times a week. The latest happened just recently with the following log entries: 2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 3441533 : cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 1005 B/s rd, 1005 B/s wr, 0 op/s 2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 3441542 : cluster [INF] osd.6 192.168.168.200:6816/3738 failed (5 reports from 5 peers after 60.747890 = grace 46.701350) 2015-04-10 19:32:53.121904 mon.0 192.168.168.13:6789/0 3441544 : cluster [INF] osdmap e74309: 18 osds: 17 up, 18 in 2015-04-10 19:32:53.231730 mon.0 192.168.168.13:6789/0 3441545 : cluster [INF] pgmap v50849467: 8508 pgs: 599 stale+active+clean, 7907 active+clean, 1 stale+active+clean+scrubbing+deep, 1 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 375 B/s rd, 0 op/s osd.6 logs around the same time are: 2015-04-10 19:16:29.110617 7fad6d5ec700 0 log_channel(default) log [INF] : 5.3d7 deep-scrub ok 2015-04-10 19:27:47.561389 7fad6bde9700 0 log_channel(default) log [INF] : 5.276 deep-scrub ok 2015-04-10 19:31:11.611321 7fad6d5ec700 0 log_channel(default) log [INF] : 5.287 deep-scrub ok 2015-04-10 19:31:53.339881 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 2015-04-10 19:31:53.339887 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 2015-04-10 19:31:53.339890 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad705f2700' had timed out after 15 2015-04-10 19:31:53.340050 7fad7e60e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 2015-04-10 19:31:53.340053 7fad7e60e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 [.] 2015-04-10 19:32:53.010609 7fad7e60e700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad86132700' had timed out after 60 2015-04-10 19:32:53.010611 7fad7e60e700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad88937700' had timed out after 60 2015-04-10 19:32:53.111470 7fad66ed2700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6837/4409 pipe(0x2b793b80 sd=179 :6817 s=2 pgs=5 cs=1 l=0 c=0x21e8b420).fault with nothing to send, going to standby 2015-04-10 19:32:53.111496 7fad6329d700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6827/4208 pipe(0x2b793600 sd=172 :6817 s=2 pgs=7 cs=1 l=0 c=0x1791ab00).fault with nothing to send, going to standby 2015-04-10 19:32:53.111463 7fad55bd0700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6822/3910 pipe(0x2cb55dc0 sd=262 :6817 s=2 pgs=8 cs=1 l=0 c=0xe7802c0).fault with nothing to send, going to standby 2015-04-10 19:32:53.121815 7fad6218c700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6807/3575 pipe(0x2cf8e080 sd=294 :6817 s=2 pgs=4 cs=1 l=0 c=0x138669a0).fault with nothing to send, going to standby 2015-04-10 19:32:53.121856 7fad67bdf700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6842/4442 pipe(0x2b792580 sd=190 :6817 s=2 pgs=9 cs=1 l=0 c=0x138922c0).fault with nothing to send, going
Re: [ceph-users] deep scrubbing causes osd down
Hi JC, I am running ceph 0.87.1 on Ubuntu 12.04 LTS server with latest patches. I am however running kernel version 3.19.3 and not the stock distro one. I am running cfq on all spindles and noop on all ssds (used for journals). I've not done any scrub specific options, but will try and see if it makes a difference. Thanks for your feedback Andrei - Original Message - From: LOPEZ Jean-Charles jelo...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: LOPEZ Jean-Charles jelo...@redhat.com, ceph-users@lists.ceph.com Sent: Saturday, 11 April, 2015 7:54:18 PM Subject: Re: [ceph-users] deep scrubbing causes osd down Hi Andrei, 1) what ceph version are you running? 2) what distro and version are you running? 3) have you checked the disk elevator for the OSD devices to be set to cfq? 4) Have have you considered exploring the following parameters to further tune - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 - osd_deep_scrub_stride If you have lowered parameters above, you can play with this one to fit best your physical disk behaviour. - osd_scrub_sleep introduce a half second sleep between 2 scrubs; e.g. = 0.5 to start with a half second delay Cheers JC On 10 Apr 2015, at 12:01, Andrei Mikhailovsky and...@arhont.com wrote: Hi guys, I was wondering if anyone noticed that the deep scrubbing process causes some osd to go down? I have been keeping an eye on a few remaining stability issues in my test cluster. One of the unsolved issues is the occasional reporting of osd(s) going down and coming back up after about 20-30 seconds. This happens to various osds throughout the cluster. I have a small cluster of just 2 osd servers with 9 osds each. The common trend that i see week after week is that whenever there is a long deep scrubbing activity on the cluster it triggers one or more osds to go down for a short period of time. After the osd is marked down, it goes back up after about 20 seconds. Obviously there is a repair process that kicks in which causes more load on the cluster. While looking at the logs, i've not seen the osds being marked down when the cluster is not deep scrubbing. It _always_ happens when there is a deep scrub activity. I am seeing the reports of osds going down about 3-4 times a week. The latest happened just recently with the following log entries: 2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 3441533 : cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 1005 B/s rd, 1005 B/s wr, 0 op/s 2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 3441542 : cluster [INF] osd.6 192.168.168.200:6816/3738 failed (5 reports from 5 peers after 60.747890 = grace 46.701350) 2015-04-10 19:32:53.121904 mon.0 192.168.168.13:6789/0 3441544 : cluster [INF] osdmap e74309: 18 osds: 17 up, 18 in 2015-04-10 19:32:53.231730 mon.0 192.168.168.13:6789/0 3441545 : cluster [INF] pgmap v50849467: 8508 pgs: 599 stale+active+clean, 7907 active+clean, 1 stale+active+clean+scrubbing+deep, 1 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 375 B/s rd, 0 op/s osd.6 logs around the same time are: 2015-04-10 19:16:29.110617 7fad6d5ec700 0 log_channel(default) log [INF] : 5.3d7 deep-scrub ok 2015-04-10 19:27:47.561389 7fad6bde9700 0 log_channel(default) log [INF] : 5.276 deep-scrub ok 2015-04-10 19:31:11.611321 7fad6d5ec700 0 log_channel(default) log [INF] : 5.287 deep-scrub ok 2015-04-10 19:31:53.339881 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 2015-04-10 19:31:53.339887 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 2015-04-10 19:31:53.339890 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad705f2700' had timed out after 15 2015-04-10 19:31:53.340050 7fad7e60e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 2015-04-10 19:31:53.340053 7fad7e60e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 [.] 2015-04-10 19:32:53.010609 7fad7e60e700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad86132700' had timed out after 60 2015-04-10 19:32:53.010611 7fad7e60e700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad88937700' had timed out after 60 2015-04-10 19:32:53.111470 7fad66ed2700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6837/4409 pipe(0x2b793b80 sd=179 :6817 s=2 pgs=5 cs=1 l=0 c=0x21e8b420).fault with nothing to send, going to standby 2015-04-10 19:32:53.111496 7fad6329d700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6827
Re: [ceph-users] deep scrubbing causes osd down
It looks like deep scrub cause the disk busy and some threads blocking on this. Maybe you could lower the scrub related configurations and see the disk util when deep-scrubing. On Sat, Apr 11, 2015 at 3:01 AM, Andrei Mikhailovsky and...@arhont.com wrote: Hi guys, I was wondering if anyone noticed that the deep scrubbing process causes some osd to go down? I have been keeping an eye on a few remaining stability issues in my test cluster. One of the unsolved issues is the occasional reporting of osd(s) going down and coming back up after about 20-30 seconds. This happens to various osds throughout the cluster. I have a small cluster of just 2 osd servers with 9 osds each. The common trend that i see week after week is that whenever there is a long deep scrubbing activity on the cluster it triggers one or more osds to go down for a short period of time. After the osd is marked down, it goes back up after about 20 seconds. Obviously there is a repair process that kicks in which causes more load on the cluster. While looking at the logs, i've not seen the osds being marked down when the cluster is not deep scrubbing. It _always_ happens when there is a deep scrub activity. I am seeing the reports of osds going down about 3-4 times a week. The latest happened just recently with the following log entries: 2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 3441533 : cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 1005 B/s rd, 1005 B/s wr, 0 op/s 2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 3441542 : cluster [INF] osd.6 192.168.168.200:6816/3738 failed (5 reports from 5 peers after 60.747890 = grace 46.701350) 2015-04-10 19:32:53.121904 mon.0 192.168.168.13:6789/0 3441544 : cluster [INF] osdmap e74309: 18 osds: 17 up, 18 in 2015-04-10 19:32:53.231730 mon.0 192.168.168.13:6789/0 3441545 : cluster [INF] pgmap v50849467: 8508 pgs: 599 stale+active+clean, 7907 active+clean, 1 stale+active+clean+scrubbing+deep, 1 active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, 23310 GB / 50206 GB avail; 375 B/s rd, 0 op/s osd.6 logs around the same time are: 2015-04-10 19:16:29.110617 7fad6d5ec700 0 log_channel(default) log [INF] : 5.3d7 deep-scrub ok 2015-04-10 19:27:47.561389 7fad6bde9700 0 log_channel(default) log [INF] : 5.276 deep-scrub ok 2015-04-10 19:31:11.611321 7fad6d5ec700 0 log_channel(default) log [INF] : 5.287 deep-scrub ok 2015-04-10 19:31:53.339881 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 2015-04-10 19:31:53.339887 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 2015-04-10 19:31:53.339890 7fad7ce0b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad705f2700' had timed out after 15 2015-04-10 19:31:53.340050 7fad7e60e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 2015-04-10 19:31:53.340053 7fad7e60e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 [.] 2015-04-10 19:32:53.010609 7fad7e60e700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad86132700' had timed out after 60 2015-04-10 19:32:53.010611 7fad7e60e700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fad88937700' had timed out after 60 2015-04-10 19:32:53.111470 7fad66ed2700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6837/4409 pipe(0x2b793b80 sd=179 :6817 s=2 pgs=5 cs=1 l=0 c=0x21e8b420).fault with nothing to send, going to standby 2015-04-10 19:32:53.111496 7fad6329d700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6827/4208 pipe(0x2b793600 sd=172 :6817 s=2 pgs=7 cs=1 l=0 c=0x1791ab00).fault with nothing to send, going to standby 2015-04-10 19:32:53.111463 7fad55bd0700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6822/3910 pipe(0x2cb55dc0 sd=262 :6817 s=2 pgs=8 cs=1 l=0 c=0xe7802c0).fault with nothing to send, going to standby 2015-04-10 19:32:53.121815 7fad6218c700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6807/3575 pipe(0x2cf8e080 sd=294 :6817 s=2 pgs=4 cs=1 l=0 c=0x138669a0).fault with nothing to send, going to standby 2015-04-10 19:32:53.121856 7fad67bdf700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6842/4442 pipe(0x2b792580 sd=190 :6817 s=2 pgs=9 cs=1 l=0 c=0x138922c0).fault with nothing to send, going to standby 2015-04-10 19:32:53.123545 7fad651bc700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6801/3053 pipe(0x15e538c0 sd=260 :6817 s=2 pgs=1 cs=1 l=0 c=0x16bf09a0).fault with nothing to send, going to standby 2015-04-10 19:32:53.128729 7fad53eb3700 0 -- 192.168.168.200:6817/3738 192.168.168.201:6832/4257 pipe(0x37dcb80 sd=311 :6817 s=2 pgs=3 cs=1 l=0 c=0x1131f420).fault with nothing to send, going to standby 2015-04-10 19:32:53.132691 7fad53fb4700 0 -- 192.168.168.200:6817/3738