Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-29 Thread Marc Schöchlin
Hello Jason,

i updated the ticket https://tracker.ceph.com/issues/40822

Am 24.07.19 um 19:20 schrieb Jason Dillaman:
> On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin  wrote:
>>
>> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
>> because the ceph apt source does not contain that version.
>> Do you know a package source?
> All the upstream packages should be available here [1], including 12.2.5.
Ah okay, i will test this tommorow.
> Did you pull the OSD blocked ops stats to figure out what is going on
> with the OSDs?
Yes, see referenced data in the ticket 
https://tracker.ceph.com/issues/40822#note-15

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-24 Thread Jason Dillaman
On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin  wrote:
>
> Hi Jason,
>
> i installed kernel 4.4.0-154.181 (from ubuntu package sources) and performed 
> the crash reproduction.
> The problem also re-appeared with that kernel release.
>
> A gunzip with 10 gunzip processes throwed 1600 write and 330 read IOPS 
> against the cluster/the rbd_ec volume with a transfer rate of 290MB/sec for 
> 10 Minutes.
> After that the same problem re-appeared.
>
> What should we do now?
>
> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
> because the ceph apt source does not contain that version.
> Do you know a package source?

All the upstream packages should be available here [1], including 12.2.5.

> How can i support you?

Did you pull the OSD blocked ops stats to figure out what is going on
with the OSDs?

> Regards
> Marc
>
> Am 24.07.19 um 07:55 schrieb Marc Schöchlin:
> > Hi Jason,
> >
> > Am 24.07.19 um 00:40 schrieb Jason Dillaman:
> >>> Sure, which kernel do you prefer?
> >> You said you have never had an issue w/ rbd-nbd 12.2.5 in your Xen 
> >> environment. Can you use a matching kernel version?
> >
> > Thats true, our virtual machines of our xen environments completly run on 
> > rbd-nbd devices.
> > Every host runs dozends of rbd-nbd maps which are visible as xen disks in 
> > the virtual systems.
> > (https://github.com/vico-research-and-consulting/RBDSR)
> >
> > It seems that xenserver has a special behavior with device timings because 
> > 1.5 years ago we had a outage of 1.5 hours of our ceph cluster which 
> > blocked all write requests
> > (overfull disks because of huge usage growth). In this situation all 
> > virtualmachines continue their work without problems after the cluster was 
> > back.
> > We haven't set any timeouts using nbd_set_timeout.c on these systems.
> >
> > We never experienced problems with these rbd-nbd instances.
> >
> > [root@xen-s31 ~]# rbd nbd ls
> > pid   pool   image  
> >   snap device
> > 10405 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-72f4e61d-acb9-4679-9b1d-fe0324cb5436 -/dev/nbd3
> > 12731 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-88f8889a-05dc-49ab-a7de-8b5f3961f9c9 -/dev/nbd4
> > 13123 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-37243066-54b0-453a-8bf3-b958153a680d -/dev/nbd5
> > 15342 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> > RBD-2bee9bf7-4fed-4735-a749-2d4874181686 -/dev/nbd6
> > 15702 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-5b93eb93-ebe7-4711-a16a-7893d24c1bbf -/dev/nbd7
> > 27568 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-616a74b5-3f57-4123-9505-dbd4c9aa9be3 -/dev/nbd8
> > 21112 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-5c673a73-7827-44cc-802c-8d626da2f401 -/dev/nbd9
> > 15726 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-1069a275-d97f-48fd-9c52-aed1d8ac9eab -/dev/nbd10
> > 4368  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-23b72184-0914-4924-8f7f-10868af7c0ab -/dev/nbd11
> > 4642  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-bf13cf77-6115-466e-85c5-aa1d69a570a0 -/dev/nbd12
> > 9438  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-a2071aa0-5f63-4425-9f67-1713851fc1ca -/dev/nbd13
> > 29191 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-fd9a299f-dad9-4ab9-b6c9-2e9650cda581 -/dev/nbd14
> > 4493  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-1bbb4135-e9ed-4720-a41a-a49b998faf42 -/dev/nbd15
> > 4683  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-374cadac-d969-49eb-8269-aa125cba82d8 -/dev/nbd16
> > 1736  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-478a20cc-58dd-4cd9-b8b1-6198014e21b1 -/dev/nbd17
> > 3648  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-6e28ec15-747a-43c9-998d-e9f2a600f266 -/dev/nbd18
> > 9993  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-61ae5ef3-9efb-4fe6-8882-45d54558313e -/dev/nbd19
> > 10324 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-f7d27673-c268-47b9-bd58-46dcd4626bbb -/dev/nbd20
> > 19330 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-0d4e5568-ac93-4f27-b24f-6624f2fa4a2b -/dev/nbd21
> > 14942 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> > RBD-69832522-fd68-49f9-810f-485947ff5e44 -/dev/nbd22
> > 20859 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-5025b066-723e-48f5-bc4e-9b8bdc1e9326 -/dev/nbd23
> > 19247 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > 

Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-24 Thread Marc Schöchlin
Hi Jason,

i installed kernel 4.4.0-154.181 (from ubuntu package sources) and performed 
the crash reproduction.
The problem also re-appeared with that kernel release.

A gunzip with 10 gunzip processes throwed 1600 write and 330 read IOPS against 
the cluster/the rbd_ec volume with a transfer rate of 290MB/sec for 10 Minutes.
After that the same problem re-appeared.

What should we do now?

Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
because the ceph apt source does not contain that version.
Do you know a package source?

How can i support you?

Regards
Marc

Am 24.07.19 um 07:55 schrieb Marc Schöchlin:
> Hi Jason,
>
> Am 24.07.19 um 00:40 schrieb Jason Dillaman:
>>> Sure, which kernel do you prefer?
>> You said you have never had an issue w/ rbd-nbd 12.2.5 in your Xen 
>> environment. Can you use a matching kernel version? 
>
> Thats true, our virtual machines of our xen environments completly run on 
> rbd-nbd devices.
> Every host runs dozends of rbd-nbd maps which are visible as xen disks in the 
> virtual systems.
> (https://github.com/vico-research-and-consulting/RBDSR)
>
> It seems that xenserver has a special behavior with device timings because 
> 1.5 years ago we had a outage of 1.5 hours of our ceph cluster which blocked 
> all write requests
> (overfull disks because of huge usage growth). In this situation all 
> virtualmachines continue their work without problems after the cluster was 
> back.
> We haven't set any timeouts using nbd_set_timeout.c on these systems.
>
> We never experienced problems with these rbd-nbd instances.
>
> [root@xen-s31 ~]# rbd nbd ls
> pid   pool   image
>     snap device
> 10405 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-72f4e61d-acb9-4679-9b1d-fe0324cb5436 -    /dev/nbd3 
> 12731 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-88f8889a-05dc-49ab-a7de-8b5f3961f9c9 -    /dev/nbd4 
> 13123 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-37243066-54b0-453a-8bf3-b958153a680d -    /dev/nbd5 
> 15342 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> RBD-2bee9bf7-4fed-4735-a749-2d4874181686 -    /dev/nbd6 
> 15702 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-5b93eb93-ebe7-4711-a16a-7893d24c1bbf -    /dev/nbd7 
> 27568 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-616a74b5-3f57-4123-9505-dbd4c9aa9be3 -    /dev/nbd8 
> 21112 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-5c673a73-7827-44cc-802c-8d626da2f401 -    /dev/nbd9 
> 15726 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-1069a275-d97f-48fd-9c52-aed1d8ac9eab -    /dev/nbd10
> 4368  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-23b72184-0914-4924-8f7f-10868af7c0ab -    /dev/nbd11
> 4642  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-bf13cf77-6115-466e-85c5-aa1d69a570a0 -    /dev/nbd12
> 9438  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-a2071aa0-5f63-4425-9f67-1713851fc1ca -    /dev/nbd13
> 29191 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-fd9a299f-dad9-4ab9-b6c9-2e9650cda581 -    /dev/nbd14
> 4493  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-1bbb4135-e9ed-4720-a41a-a49b998faf42 -    /dev/nbd15
> 4683  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-374cadac-d969-49eb-8269-aa125cba82d8 -    /dev/nbd16
> 1736  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-478a20cc-58dd-4cd9-b8b1-6198014e21b1 -    /dev/nbd17
> 3648  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-6e28ec15-747a-43c9-998d-e9f2a600f266 -    /dev/nbd18
> 9993  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-61ae5ef3-9efb-4fe6-8882-45d54558313e -    /dev/nbd19
> 10324 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-f7d27673-c268-47b9-bd58-46dcd4626bbb -    /dev/nbd20
> 19330 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-0d4e5568-ac93-4f27-b24f-6624f2fa4a2b -    /dev/nbd21
> 14942 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> RBD-69832522-fd68-49f9-810f-485947ff5e44 -    /dev/nbd22
> 20859 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-5025b066-723e-48f5-bc4e-9b8bdc1e9326 -    /dev/nbd23
> 19247 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-095292a0-6cc2-4112-95bf-15cb3dd33e9a -    /dev/nbd24
> 22356 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-f8229ea0-ad7b-4034-9cbe-7353792a2b7c -    /dev/nbd25
> 22537 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-e8c0b841-50ec-4765-a3cb-30c78a4b9162 -    /dev/nbd26
> 15105 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> 

Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-24 Thread Mike Christie
On 07/23/2019 12:28 AM, Marc Schöchlin wrote:
>>> For testing purposes i set the timeout to unlimited ("nbd_set_ioctl 
>>> /dev/nbd0 0", on already mounted device).
>>> >> I re-executed the problem procedure and discovered that the 
>>> >> compression-procedure crashes not at the same file, but crashes 30 
>>> >> seconds later with the same crash behavior.
>>> >>
>> > 0 will cause the default timeout of 30 secs to be used.
> Okay, then the usage description of 
> https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c not 
> seems to be correct :-)

It is correct for older kernels:

With older kernels you could turn it off by setting it to 0, and it was
off by default.

With newer kernels, it's on by default, and there is no way to turn it off.

So with older kernels, you could have been hitting similar slow downs,
but you would have never seen the timeouts, io errors, etc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-23 Thread Marc Schöchlin
Hi Jason,

Am 24.07.19 um 00:40 schrieb Jason Dillaman:
>
>> Sure, which kernel do you prefer?
> You said you have never had an issue w/ rbd-nbd 12.2.5 in your Xen 
> environment. Can you use a matching kernel version? 


Thats true, our virtual machines of our xen environments completly run on 
rbd-nbd devices.
Every host runs dozends of rbd-nbd maps which are visible as xen disks in the 
virtual systems.
(https://github.com/vico-research-and-consulting/RBDSR)

It seems that xenserver has a special behavior with device timings because 1.5 
years ago we had a outage of 1.5 hours of our ceph cluster which blocked all 
write requests
(overfull disks because of huge usage growth). In this situation all 
virtualmachines continue their work without problems after the cluster was back.
We haven't set any timeouts using nbd_set_timeout.c on these systems.

We never experienced problems with these rbd-nbd instances.

[root@xen-s31 ~]# rbd nbd ls
pid   pool   image  
  snap device
10405 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-72f4e61d-acb9-4679-9b1d-fe0324cb5436 -    /dev/nbd3 
12731 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
RBD-88f8889a-05dc-49ab-a7de-8b5f3961f9c9 -    /dev/nbd4 
13123 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
RBD-37243066-54b0-453a-8bf3-b958153a680d -    /dev/nbd5 
15342 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
RBD-2bee9bf7-4fed-4735-a749-2d4874181686 -    /dev/nbd6 
15702 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
RBD-5b93eb93-ebe7-4711-a16a-7893d24c1bbf -    /dev/nbd7 
27568 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
RBD-616a74b5-3f57-4123-9505-dbd4c9aa9be3 -    /dev/nbd8 
21112 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-5c673a73-7827-44cc-802c-8d626da2f401 -    /dev/nbd9 
15726 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-1069a275-d97f-48fd-9c52-aed1d8ac9eab -    /dev/nbd10
4368  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
RBD-23b72184-0914-4924-8f7f-10868af7c0ab -    /dev/nbd11
4642  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-bf13cf77-6115-466e-85c5-aa1d69a570a0 -    /dev/nbd12
9438  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-a2071aa0-5f63-4425-9f67-1713851fc1ca -    /dev/nbd13
29191 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
RBD-fd9a299f-dad9-4ab9-b6c9-2e9650cda581 -    /dev/nbd14
4493  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
RBD-1bbb4135-e9ed-4720-a41a-a49b998faf42 -    /dev/nbd15
4683  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-374cadac-d969-49eb-8269-aa125cba82d8 -    /dev/nbd16
1736  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-478a20cc-58dd-4cd9-b8b1-6198014e21b1 -    /dev/nbd17
3648  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-6e28ec15-747a-43c9-998d-e9f2a600f266 -    /dev/nbd18
9993  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
RBD-61ae5ef3-9efb-4fe6-8882-45d54558313e -    /dev/nbd19
10324 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-f7d27673-c268-47b9-bd58-46dcd4626bbb -    /dev/nbd20
19330 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
RBD-0d4e5568-ac93-4f27-b24f-6624f2fa4a2b -    /dev/nbd21
14942 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
RBD-69832522-fd68-49f9-810f-485947ff5e44 -    /dev/nbd22
20859 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
RBD-5025b066-723e-48f5-bc4e-9b8bdc1e9326 -    /dev/nbd23
19247 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-095292a0-6cc2-4112-95bf-15cb3dd33e9a -    /dev/nbd24
22356 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
RBD-f8229ea0-ad7b-4034-9cbe-7353792a2b7c -    /dev/nbd25
22537 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-e8c0b841-50ec-4765-a3cb-30c78a4b9162 -    /dev/nbd26
15105 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-6d3d3503-2b45-45e9-a17b-30ab65c2be3d -    /dev/nbd27
28192 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
RBD-e04ec9e6-da4c-4b7a-b257-2cf7022ac59f -    /dev/nbd28
28507 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
RBD-e6d213b3-89d6-4c09-bc65-18ed7992149d -    /dev/nbd29
23206 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
RBD-638ef476-843e-4c26-8202-377f185d9d26 -    /dev/nbd30


[root@xen-s31 ~]# uname -a
Linux xen-s31 4.4.0+10 #1 SMP Wed Dec 6 13:56:09 UTC 2017 x86_64 x86_64 x86_64 
GNU/Linux

[root@xen-s31 ~]# rpm -qa|grep -P "ceph|rbd"
librbd1-12.2.5-0.el7.x86_64
python-rbd-12.2.5-0.el7.x86_64
ceph-common-12.2.5-0.el7.x86_64
python-cephfs-12.2.5-0.el7.x86_64
rbd-fuse-12.2.5-0.el7.x86_64
libcephfs2-12.2.5-0.el7.x86_64

Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-23 Thread Marc Schöchlin
Hi Jason,

Am 23.07.19 um 14:41 schrieb Jason Dillaman
> Can you please test a consistent Ceph release w/ a known working
> kernel release? It sounds like you have changed two variables, so it's
> hard to know which one is broken. We need *you* to isolate what
> specific Ceph or kernel release causes the break.
Sure, lets find the origin of this problem. :-)
>
> We really haven't made many changes to rbd-nbd, but the kernel has had
> major changes to the nbd driver. As Mike pointed out on the tracker
> ticket, one of those major changes effectively capped the number of
> devices at 256. Can you repeat this with a single device? 


Definitely, the problematic rbd-nbd runs on a virtual system which only 
utilizes one single nbd and one single krbd device.

To be clear:

# lsb_release -d
Description:    Ubuntu 16.04.5 LTS

# uname -a
Linux archiv-001 4.15.0-45-generic #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

# rbd nbd ls
pid    pool    image snap device   
626931 rbd_hdd archiv-001-srv_ec -    /dev/nbd0 

# rbd showmapped
id pool    image  snap device   
0  rbd_hdd archiv-001_srv -    /dev/rbd0 

# df -h|grep -P "File|nbd|rbd"
Filesystem    Size  Used Avail Use% Mounted on
/dev/rbd0  32T   31T  1.8T  95% /srv
/dev/nbd0 3.0T  1.3T  1.8T  42% /srv_ec

#  mount|grep -P "nbd|rbd"
/dev/rbd0 on /srv type xfs 
(rw,relatime,attr2,largeio,inode64,allocsize=4096k,logbufs=8,logbsize=256k,sunit=8192,swidth=8192,noquota,_netdev)
/dev/nbd0 on /srv_ec type xfs 
(rw,relatime,attr2,discard,largeio,inode64,allocsize=4096k,logbufs=8,logbsize=256k,noquota,_netdev)

# dpkg -l|grep -P "rbd|ceph"
ii  ceph-common   12.2.11-1xenial   
 amd64    common utilities to mount and interact with a ceph storage 
cluster
ii  libcephfs2    12.2.11-1xenial   
 amd64    Ceph distributed file system client library
ii  librbd1   12.2.11-1xenial   
 amd64    RADOS block device client library
ii  python-cephfs 12.2.11-1xenial   
 amd64    Python 2 libraries for the Ceph libcephfs library
ii  python-rbd    12.2.11-1xenial   
 amd64    Python 2 libraries for the Ceph librbd library
ii  rbd-nbd   12.2.12-1xenial   
 amd64    NBD-based rbd client for the Ceph distributed file system

More details regarding the problem environment can be gathered in my initial 
mail below the description"Environment". 
> Can you
> repeat this on Ceph rbd-nbd 12.2.11 with an older kernel?

Sure, which kernel do you prefer?

I can test with following releases:

# apt-cache search linux-image-4.*.*.*-*-generic 2>&1|sed 
'~s,\.[0-9]*-[0-9]*-*-generic - .*,,;~s,linux-image-,,'|sort -u
4.10
4.11
4.13
4.15
4.4
4.8

We can also perform tests by using another filesystem (i.e. ext4).

From my point of view i suppose that there is something wrong nbd.ko or with 
rbd-nbd (excluding rbd-cache functionality) - therefore i do not think that 
this very promising

Regards
Marc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-23 Thread Jason Dillaman
On Tue, Jul 23, 2019 at 6:58 AM Marc Schöchlin  wrote:
>
>
> Am 23.07.19 um 07:28 schrieb Marc Schöchlin:
> >
> > Okay, i already experimented with high timeouts (i.e 600 seconds). As i can 
> > remember this leaded to pretty unusable system if i put high amounts of io 
> > on the ec volume.
> > This system also runs als krbd volume which saturates the system with 
> > ~30-60% iowait - this volume never had a problem.
> >
> > A comment writer in https://tracker.ceph.com/issues/40822#change-141205 
> > suggests me to reduce the rbd cache.
> > What do you think about that?
>
> Test with reduce rbd cache still fail, therefore i made tests with disabled 
> rbd cache:
>
> - i disabled rbd cache with "rbd cache = false"
> - unmounted and unmapped the image
> - mapped and mounted the image
> - re-executed my test'
>find /srv_ec type f -name "*.sql" -exec gzip -v {} \;
>
>
> It took several hours, but at the end i have the same error situation.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Can you please test a consistent Ceph release w/ a known working
kernel release? It sounds like you have changed two variables, so it's
hard to know which one is broken. We need *you* to isolate what
specific Ceph or kernel release causes the break.

We really haven't made many changes to rbd-nbd, but the kernel has had
major changes to the nbd driver. As Mike pointed out on the tracker
ticket, one of those major changes effectively capped the number of
devices at 256. Can you repeat this with a single device? Can you
repeat this on Ceph rbd-nbd 12.2.11 with an older kernel?

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-23 Thread Marc Schöchlin


Am 23.07.19 um 07:28 schrieb Marc Schöchlin:
>
> Okay, i already experimented with high timeouts (i.e 600 seconds). As i can 
> remember this leaded to pretty unusable system if i put high amounts of io on 
> the ec volume.
> This system also runs als krbd volume which saturates the system with ~30-60% 
> iowait - this volume never had a problem.
>
> A comment writer in https://tracker.ceph.com/issues/40822#change-141205 
> suggests me to reduce the rbd cache.
> What do you think about that?

Test with reduce rbd cache still fail, therefore i made tests with disabled rbd 
cache:

- i disabled rbd cache with "rbd cache = false"
- unmounted and unmapped the image
- mapped and mounted the image
- re-executed my test'
   find /srv_ec type f -name "*.sql" -exec gzip -v {} \;


It took several hours, but at the end i have the same error situation.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-22 Thread Marc Schöchlin
Hi Mike,

Am 22.07.19 um 16:48 schrieb Mike Christie:
> On 07/22/2019 06:00 AM, Marc Schöchlin wrote:
>>> With older kernels no timeout would be set for each command by default,
>>> so if you were not running that tool then you would not see the nbd
>>> disconnect+io_errors+xfs issue. You would just see slow IOs.
>>>
>>> With newer kernels, like 4.15, nbd.ko always sets a per command timeout
>>> even if you do not set it via a nbd ioctl/netlink command. By default
>>> the timeout is 30 seconds. After the timeout period then the kernel does
>>> that disconnect+IO_errors error handling which causes xfs to get errors.
>>>
>> Did i get you correctly: Setting a unlimited timeout should prevent crashes 
>> on kernel 4.15?
> It looks like with newer kernels there is no way to turn it off.
>
> You can set it really high. There is no max check and so it depends on
> various calculations and what some C types can hold and how your kernel
> is compiled. You should be able to set the timer to an hour.

Okay, i already experimented with high timeouts (i.e 600 seconds). As i can 
remember this leaded to pretty unusable system if i put high amounts of io on 
the ec volume.
This system also runs als krbd volume which saturates the system with ~30-60% 
iowait - this volume never had a problem.

A comment writer in https://tracker.ceph.com/issues/40822#change-141205 
suggests me to reduce the rbd cache.
What do you think about that?

>
>> For testing purposes i set the timeout to unlimited ("nbd_set_ioctl 
>> /dev/nbd0 0", on already mounted device).
>> I re-executed the problem procedure and discovered that the 
>> compression-procedure crashes not at the same file, but crashes 30 seconds 
>> later with the same crash behavior.
>>
> 0 will cause the default timeout of 30 secs to be used.

Okay, then the usage description of 
https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c not seems 
to be correct :-)

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-22 Thread Marc Schöchlin
Hi Mike,

Am 22.07.19 um 17:01 schrieb Mike Christie:
> On 07/19/2019 02:42 AM, Marc Schöchlin wrote:
>> We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, 
>> kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 
>> 4.15.0-45, ubuntu 16.04) - we never experienced problems like this.
> For this setup, do you have 257 or more rbd-nbd devices running on a
> single system?
No, these rbd-nbds are distributed over more than a dozen of xen dom-0 systems 
on our xenservers.
> If so then you are hitting another bug where newer kernels only support
> 256 devices. It looks like a regression was added when mq and netlink
> support was added upstream. You can create more then 256 devices, but
> some devices will not be able to execute any IO. Commands sent to the
> rbd-nbd device are going to always timeout and you will see the errors
> in your log.
>
> I am testing some patches for that right now.

From my point of view there is no limitation besides io from ceph cluster 
perspective.

Regards
Marc

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-22 Thread Mike Christie
On 07/19/2019 02:42 AM, Marc Schöchlin wrote:
> We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, 
> kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 
> 4.15.0-45, ubuntu 16.04) - we never experienced problems like this.

For this setup, do you have 257 or more rbd-nbd devices running on a
single system?

If so then you are hitting another bug where newer kernels only support
256 devices. It looks like a regression was added when mq and netlink
support was added upstream. You can create more then 256 devices, but
some devices will not be able to execute any IO. Commands sent to the
rbd-nbd device are going to always timeout and you will see the errors
in your log.

I am testing some patches for that right now.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-22 Thread Mike Christie
On 07/22/2019 06:00 AM, Marc Schöchlin wrote:
>> With older kernels no timeout would be set for each command by default,
>> so if you were not running that tool then you would not see the nbd
>> disconnect+io_errors+xfs issue. You would just see slow IOs.
>>
>> With newer kernels, like 4.15, nbd.ko always sets a per command timeout
>> even if you do not set it via a nbd ioctl/netlink command. By default
>> the timeout is 30 seconds. After the timeout period then the kernel does
>> that disconnect+IO_errors error handling which causes xfs to get errors.
>>
> Did i get you correctly: Setting a unlimited timeout should prevent crashes 
> on kernel 4.15?

It looks like with newer kernels there is no way to turn it off.

You can set it really high. There is no max check and so it depends on
various calculations and what some C types can hold and how your kernel
is compiled. You should be able to set the timer to an hour.

> 
> For testing purposes i set the timeout to unlimited ("nbd_set_ioctl /dev/nbd0 
> 0", on already mounted device).
> I re-executed the problem procedure and discovered that the 
> compression-procedure crashes not at the same file, but crashes 30 seconds 
> later with the same crash behavior.
> 

0 will cause the default timeout of 30 secs to be used.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-22 Thread Marc Schöchlin
Hello Mike,

i attached inline comments.

Am 19.07.19 um 22:20 schrieb Mike Christie:
>
>> We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, 
>> kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 
>> 4.15.0-45, ubuntu 16.04) - we never experienced problems like this.
>> We only experience problems like this with rbd-nbd > 12.2.5 on ubuntu 16.04 
>> (kernel 4.15) or ubuntu 18.04 (kernel 4.15) with erasure encoding or without.
>>
> Are you only using the nbd_set_timeout tool for this newer kernel combo
> to try and workaround the disconnect+io_errors problem in newer kernels,
> or did you use that tool to set a timeout with older kernels? I am just
> trying to clarify the problem, because the kernel changed behavior and I
> am not sure if your issue is the very slow IO or that the kernel now
> escalates its error handler by default.
I only use nbd_set_timeout with the 4.15 kernels on obuntu 16.04 and 18.04 
because we experienced problems some weeks ago on "fstrim" activities a few 
weeks ago.
Adding timeouts of 60 seconds seemed to help, but did not solve the problem 
completely.

The problem situation described in my request is a different distuation but 
seems to be sourced in the same rootcause.

Not using the nbd_set_timeout tool, results in the same but more prominent 
problem situations :-)
(test with unloading the nbd module and re-executing the test)
>
> With older kernels no timeout would be set for each command by default,
> so if you were not running that tool then you would not see the nbd
> disconnect+io_errors+xfs issue. You would just see slow IOs.
>
> With newer kernels, like 4.15, nbd.ko always sets a per command timeout
> even if you do not set it via a nbd ioctl/netlink command. By default
> the timeout is 30 seconds. After the timeout period then the kernel does
> that disconnect+IO_errors error handling which causes xfs to get errors.
>
Did i get you correctly: Setting a unlimited timeout should prevent crashes on 
kernel 4.15?

For testing purposes i set the timeout to unlimited ("nbd_set_ioctl /dev/nbd0 
0", on already mounted device).
I re-executed the problem procedure and discovered that the 
compression-procedure crashes not at the same file, but crashes 30 seconds 
later with the same crash behavior.

Regards
Marc


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-19 Thread Mike Christie
On 07/19/2019 02:42 AM, Marc Schöchlin wrote:
> Hello Jason,
> 
> Am 18.07.19 um 20:10 schrieb Jason Dillaman:
>> On Thu, Jul 18, 2019 at 1:47 PM Marc Schöchlin  wrote:
>>> Hello cephers,
>>>
>>> rbd-nbd crashes in a reproducible way here.
>> I don't see a crash report in the log below. Is it really crashing or
>> is it shutting down? If it is crashing and it's reproducable, can you
>> install the debuginfo packages, attach gdb, and get a full backtrace
>> of the crash?
> 
> I do not get a crash report of rbd-nbd.
> I seems that "rbd-nbd" just terminates, and crashes the xfs filesystem 
> because the nbd device is not available anymore.
> ("rbd nbd ls" shows no mapped device anymore)
> 
>>
>> It seems like your cluster cannot keep up w/ the load and the nbd
>> kernel driver is timing out the IO and shutting down. There is a
>> "--timeout" option on "rbd-nbd" that you can use to increase the
>> kernel IO timeout for nbd.
>>
> I have also a 36TB XFS (non_ec) volume on this virtual system mapped by krbd 
> which is under really heavy read/write usage.
> I never experienced problems like this on this system with similar usage 
> patterns.
> 
> The volume which is involved in the problem only handles a really low load 
> and i was capable to create the error situation by using the simple "find . 
> -type f -name "*.sql" -exec ionice -c3 nice -n 20 gzip -v {} \;" command.
> I copied and read ~1.5 TB of data to this volume without a problem - it seems 
> that the gzip command provokes a io pattern which leads to the error 
> situation.
> 
> As described i use a luminous "12.2.11" client which does not support that 
> "--timeout" option (btw. a backport would be nice).
> Our ceph system runs with a heavy write load, therefore we already set a 60 
> seconds timeout using the following code:
> (https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c)
> 
> We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, 
> kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 
> 4.15.0-45, ubuntu 16.04) - we never experienced problems like this.
> We only experience problems like this with rbd-nbd > 12.2.5 on ubuntu 16.04 
> (kernel 4.15) or ubuntu 18.04 (kernel 4.15) with erasure encoding or without.
>

Are you only using the nbd_set_timeout tool for this newer kernel combo
to try and workaround the disconnect+io_errors problem in newer kernels,
or did you use that tool to set a timeout with older kernels? I am just
trying to clarify the problem, because the kernel changed behavior and I
am not sure if your issue is the very slow IO or that the kernel now
escalates its error handler by default.

With older kernels no timeout would be set for each command by default,
so if you were not running that tool then you would not see the nbd
disconnect+io_errors+xfs issue. You would just see slow IOs.

With newer kernels, like 4.15, nbd.ko always sets a per command timeout
even if you do not set it via a nbd ioctl/netlink command. By default
the timeout is 30 seconds. After the timeout period then the kernel does
that disconnect+IO_errors error handling which causes xfs to get errors.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-19 Thread Marc Schöchlin
Hello Jason,

Am 18.07.19 um 20:10 schrieb Jason Dillaman:
> On Thu, Jul 18, 2019 at 1:47 PM Marc Schöchlin  wrote:
>> Hello cephers,
>>
>> rbd-nbd crashes in a reproducible way here.
> I don't see a crash report in the log below. Is it really crashing or
> is it shutting down? If it is crashing and it's reproducable, can you
> install the debuginfo packages, attach gdb, and get a full backtrace
> of the crash?

I do not get a crash report of rbd-nbd.
I seems that "rbd-nbd" just terminates, and crashes the xfs filesystem because 
the nbd device is not available anymore.
("rbd nbd ls" shows no mapped device anymore)

>
> It seems like your cluster cannot keep up w/ the load and the nbd
> kernel driver is timing out the IO and shutting down. There is a
> "--timeout" option on "rbd-nbd" that you can use to increase the
> kernel IO timeout for nbd.
>
I have also a 36TB XFS (non_ec) volume on this virtual system mapped by krbd 
which is under really heavy read/write usage.
I never experienced problems like this on this system with similar usage 
patterns.

The volume which is involved in the problem only handles a really low load and 
i was capable to create the error situation by using the simple "find . -type f 
-name "*.sql" -exec ionice -c3 nice -n 20 gzip -v {} \;" command.
I copied and read ~1.5 TB of data to this volume without a problem - it seems 
that the gzip command provokes a io pattern which leads to the error situation.

As described i use a luminous "12.2.11" client which does not support that 
"--timeout" option (btw. a backport would be nice).
Our ceph system runs with a heavy write load, therefore we already set a 60 
seconds timeout using the following code:
(https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c)

We have ~500 heavy load rbd-nbd devices in our xen cluster (rbd-nbd 12.2.5, 
kernel 4.4.0+10, centos clone) and ~20 high load krbd devices (kernel 
4.15.0-45, ubuntu 16.04) - we never experienced problems like this.
We only experience problems like this with rbd-nbd > 12.2.5 on ubuntu 16.04 
(kernel 4.15) or ubuntu 18.04 (kernel 4.15) with erasure encoding or without.

Regards
Marc


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-18 Thread Jason Dillaman
On Thu, Jul 18, 2019 at 1:47 PM Marc Schöchlin  wrote:
>
> Hello cephers,
>
> rbd-nbd crashes in a reproducible way here.

I don't see a crash report in the log below. Is it really crashing or
is it shutting down? If it is crashing and it's reproducable, can you
install the debuginfo packages, attach gdb, and get a full backtrace
of the crash?

It seems like your cluster cannot keep up w/ the load and the nbd
kernel driver is timing out the IO and shutting down. There is a
"--timeout" option on "rbd-nbd" that you can use to increase the
kernel IO timeout for nbd.

> I created the following bug report: https://tracker.ceph.com/issues/40822
>
> Do you also experience this problem?
> Do you have suggestions for in depth debug data collection?
>
> I invoke the following command on a freshly mapped rbd and rbd_rbd crashes:
>
> # find . -type f -name "*.sql" -exec ionice -c3 nice -n 20 gzip -v {} \;
> gzip: ./deprecated_data/data_archive.done/entry_search_201232.sql.gz already 
> exists; do you wish to overwrite (y or n)? y
> ./deprecated_data/data_archive.done/entry_search_201232.sql: 84.1% -- 
> replaced with ./deprecated_data/data_archive.done/entry_search_201232.sql.gz
> ./deprecated_data/data_archive.done/entry_search_201233.sql:
> gzip: ./deprecated_data/data_archive.done/entry_search_201233.sql: 
> Input/output error
> gzip: ./deprecated_data/data_archive.done/entry_search_201234.sql: 
> Input/output error
> gzip: ./deprecated_data/data_archive.done/entry_search_201235.sql: 
> Input/output error
> gzip: ./deprecated_data/data_archive.done/entry_search_201236.sql: 
> Input/output error
> 
>
> dmesg output:
>
> [579763.020890] block nbd0: Connection timed out
> [579763.020926] block nbd0: shutting down sockets
> [579763.020943] print_req_error: I/O error, dev nbd0, sector 3221296950
> [579763.020946] block nbd0: Receive data failed (result -32)
> [579763.020952] print_req_error: I/O error, dev nbd0, sector 4523172248
> [579763.021001] XFS (nbd0): metadata I/O error: block 0xc0011736 
> ("xlog_iodone") error 5 numblks 512
> [579763.021031] XFS (nbd0): xfs_do_force_shutdown(0x2) called from line 1261 
> of file /build/linux-hwe-xJVMkx/linux-hwe-4.15.0/fs/xfs/xfs_log.c.  Return 
> address = 0x918af758
> [579763.021046] print_req_error: I/O error, dev nbd0, sector 4523172248
> [579763.021161] XFS (nbd0): Log I/O Error Detected.  Shutting down filesystem
> [579763.021176] XFS (nbd0): Please umount the filesystem and rectify the 
> problem(s)
> [579763.176834] print_req_error: I/O error, dev nbd0, sector 3221296969
> [579763.176856] print_req_error: I/O error, dev nbd0, sector 2195113096
> [579763.176869] XFS (nbd0): metadata I/O error: block 0xc0011749 
> ("xlog_iodone") error 5 numblks 512
> [579763.176884] XFS (nbd0): xfs_do_force_shutdown(0x2) called from line 1261 
> of file /build/linux-hwe-xJVMkx/linux-hwe-4.15.0/fs/xfs/xfs_log.c.  Return 
> address = 0x918af758
> [579763.252836] print_req_error: I/O error, dev nbd0, sector 2195113352
> [579763.252859] print_req_error: I/O error, dev nbd0, sector 2195113608
> [579763.252869] print_req_error: I/O error, dev nbd0, sector 2195113864
> [579763.356841] print_req_error: I/O error, dev nbd0, sector 2195114120
> [579763.356885] print_req_error: I/O error, dev nbd0, sector 2195114376
> [579763.358040] XFS (nbd0): writeback error on sector 2195119688
> [579763.916813] block nbd0: Connection timed out
> [579768.140839] block nbd0: Connection timed out
> [579768.140859] print_req_error: 21 callbacks suppressed
> [579768.140860] print_req_error: I/O error, dev nbd0, sector 2195112840
> [579768.141101] XFS (nbd0): writeback error on sector 2195115592
>
> /var/log/ceph/ceph-client.archiv.log
>
> 2019-07-18 14:52:55.387815 7fffcf7fe700  1 -- 10.23.27.200:0/3920476044 --> 
> 10.23.27.151:6806/2322641 -- osd_op(unknown.0.0:1853 34.132 
> 34:4cb446f4:::rbd_header.6c73776b8b4567:head [watch unwatch cookie 
> 140736414969824] snapc 0=[] ondisk+write+known_if_redirected e256219) v8 -- 
> 0x7fffc803a340 con 0
> 2019-07-18 14:52:55.388656 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== 
> osd.17 10.23.27.151:6806/2322641 90  watch-notify(notify (1) cookie 
> 140736414969824 notify 1100452225614816 ret 0) v3  68+0+0 (1852866777 0 
> 0) 0x7fffe187b4c0 con 0x7fffc00054d0
> 2019-07-18 14:52:55.388738 7fffe913b700  1 -- 10.23.27.200:0/3920476044 <== 
> osd.17 10.23.27.151:6806/2322641 91  osd_op_reply(1852 
> rbd_header.6c73776b8b4567 [notify cookie 140736550101040] v0'0 uv2102967 
> ondisk = 0) v8  169+0+8 (3077247585 0 3199212159) 0x7fffe0002ef0 con 
> 0x7fffc00054d0
> 2019-07-18 14:52:55.388815 7fffc700  5 librbd::Watcher: 0x7fffc0001010 
> notifications_blocked: blocked=1
> 2019-07-18 14:52:55.388904 7fffc700  1 -- 10.23.27.200:0/3920476044 --> 
> 10.23.27.151:6806/2322641 -- osd_op(unknown.0.0:1854 34.132 
> 34:4cb446f4:::rbd_header.6c73776b8b4567:head [notify-ack cookie 0] snapc 0=[] 
> ondisk+read+known_if_redirected