Re: [ceph-users] Not timing out watcher

Jason Dillaman Wed, 20 Dec 2017 09:57:07 -0800

... looks like this watch "timeout" was introduced in the kraken
release [1] so if you don't see this issue with a Jewel cluster, I
suspect that's the cause.


[1] https://github.com/ceph/ceph/pull/11378

On Wed, Dec 20, 2017 at 12:53 PM, Jason Dillaman <jdill...@redhat.com> wrote:
> The OSDs will optionally take a "timeout" parameter on the watch
> request [1][2]. However, the kernel doesn't have this timeout field in
> its watch op [3] so perhaps it's defaulting to a random value.
>
> Ilya?
>
> [1] https://github.com/ceph/ceph/blob/v12.2.2/src/osd/PrimaryLogPG.cc#L6034
> [2] https://github.com/ceph/ceph/blob/v12.2.2/src/include/rados.h#L577
> [3] 
> https://github.com/torvalds/linux/blob/v4.14/include/linux/ceph/rados.h#L487
>
>
> On Wed, Dec 20, 2017 at 12:20 PM, Serguei Bezverkhi (sbezverk)
> <sbezv...@cisco.com> wrote:
>> It took 30 minutes for the Watcher to time out after ungraceful restart. Is 
>> there a way limit it to something a bit more reasonable? Like 1-3 minutes?
>>
>> On 2017-12-20, 12:01 PM, "Serguei Bezverkhi (sbezverk)" <sbezv...@cisco.com> 
>> wrote:
>>
>>     Ok, here is what I found out. If I gracefully kill a pod then watcher 
>> gets properly cleared, but if it is done ungracefully, without “rbd unmap” 
>> then even after a node reboot Watcher stays up for a long time,  it has been 
>> more than 20 minutes and it is still active (no any kubernetes services are 
>> running).
>>
>>     I was wondering if you would accept the following solution. If in 
>> rbdStatus instead of checking just for watcher, we check for existence of 
>> /dev/rbd/{pool}/{image}. If it is not there, it would mean stale Watcher and 
>> it is safe to map this image. Appreciate your thoughts here.
>>
>>     Thank you
>>     Serguei
>>
>>     On 2017-12-20, 11:32 AM, "Serguei Bezverkhi (sbezverk)" 
>> <sbezv...@cisco.com> wrote:
>>
>>
>>
>>         On 2017-12-20, 11:17 AM, "Jason Dillaman" <jdill...@redhat.com> 
>> wrote:
>>
>>             On Wed, Dec 20, 2017 at 11:01 AM, Serguei Bezverkhi (sbezverk)
>>             <sbezv...@cisco.com> wrote:
>>             > Hello Jason, thank you for your prompt reply.
>>             >
>>             > My setup is very simple, I have 1 Centos 7.4 VM which is a 
>> storage node which is running latest 12.2.2 Luminous and 2nd VM is Ubuntu 
>> 16.04.3 192.168.80.235 where I run local kubernetes cluster based on the 
>> master.
>>             >
>>             > On client side I have ceph-common installed and I copied to 
>> /etc/ceph config and key rings from the storage.
>>             >
>>             > While running my PR I noticed that rmd map was failing on a 
>> just rebooted VM because rbdStatus was finding active Watcher. Even adding 
>> 30 seconds did not help as it was not timing out at all even with no any 
>> image mapping.
>>
>>             OK -- but how did you get two different watchers listed? That 
>> implies
>>             the first one timed out at some point in time. Does the watcher
>>             eventually go away if you shut down all k8s processes on
>>
>>         I cannot answer why there are two different watchers, I was just 
>> capturing info and until you pointed as I was not aware of that. I just 
>> checked VM and finally Watcher timed out. I cannot say how long it took, but 
>> I will run another set of tests to find out.
>>
>>             192.168.80.235?  Are you overriding the 
>> "osd_client_watch_timeout"
>>             configuration setting somewhere on the OSD host?
>>
>>         No, no changes to default values were done.
>>
>>         > As per your format 1 comment, I tried using format v2 and it was 
>> failing to map due to differences in capabilities as per rootfs suggestion I 
>> switched back to v1. Once Watcher issue is resolved I can switch back to v2 
>> to show the exact issue I hit.
>>             >
>>             > Please let me know if you need any additional info.
>>             >
>>             > Thank you
>>             > Serguei
>>             >
>>             > On 2017-12-20, 10:39 AM, "Jason Dillaman" 
>> <jdill...@redhat.com> wrote:
>>             >
>>             >     Can you please provide steps to repeat this scenario? What 
>> is/was the
>>             >     client running on the host at 192.168.80.235 and how did 
>> you shut down
>>             >     that client? In your PR [1], it showed a different client 
>> as a watcher
>>             >     ("192.168.80.235:0/34739158 client.64354 cookie=1"), so 
>> how did the
>>             >     previous entry get cleaned up?
>>             >
>>             >     BTW -- unrelated, but k8s should be creating RBD image 
>> format 2 images
>>             >     [2]. Was that image created using an older version of k8s 
>> or did you
>>             >     override your settings to pick the deprecated v1 format?
>>             >
>>             >     [1] 
>> https://github.com/kubernetes/kubernetes/pull/56651#issuecomment-352850884
>>             >     [2] https://github.com/kubernetes/kubernetes/pull/51574
>>             >
>>             >     On Wed, Dec 20, 2017 at 10:24 AM, Serguei Bezverkhi 
>> (sbezverk)
>>             >     <sbezv...@cisco.com> wrote:
>>             >     > Hello,
>>             >     >
>>             >     > I hit an issue with latest Luminous when a Watcher is 
>> not timing out when the image is not mapped. It seems something similar was 
>> reported in 2016, here is the link:
>>             >     > 
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012140.html
>>             >     > Has it been fixed? Appreciate some help here.
>>             >     > Thank you
>>             >     > Serguei
>>             >     >
>>             >     > date; sudo rbd status raw-volume --pool kubernetes
>>             >     > Wed Dec 20 10:04:19 EST 2017
>>             >     > Watchers:
>>             >     >         watcher=192.168.80.235:0/3789045165 client.64439 
>> cookie=1
>>             >     > date; sudo rbd status raw-volume --pool kubernetes
>>             >     > Wed Dec 20 10:04:51 EST 2017
>>             >     > Watchers:
>>             >     >         watcher=192.168.80.235:0/3789045165 client.64439 
>> cookie=1
>>             >     > date; sudo rbd status raw-volume --pool kubernetes
>>             >     > Wed Dec 20 10:05:14 EST 2017
>>             >     > Watchers:
>>             >     >         watcher=192.168.80.235:0/3789045165 client.64439 
>> cookie=1
>>             >     >
>>             >     > date; sudo rbd status raw-volume --pool kubernetes
>>             >     > Wed Dec 20 10:07:24 EST 2017
>>             >     > Watchers:
>>             >     >         watcher=192.168.80.235:0/3789045165 client.64439 
>> cookie=1
>>             >     >
>>             >     > sudo ls /dev/rbd*
>>             >     > ls: cannot access '/dev/rbd*': No such file or directory
>>             >     >
>>             >     > sudo rbd info raw-volume --pool kubernetes
>>             >     > rbd image 'raw-volume':
>>             >     >         size 10240 MB in 2560 objects
>>             >     >         order 22 (4096 kB objects)
>>             >     >         block_name_prefix: rb.0.fafa.625558ec
>>             >     >         format: 1
>>             >     >
>>             >     >
>>             >     >
>>             >     > _______________________________________________
>>             >     > ceph-users mailing list
>>             >     > ceph-users@lists.ceph.com
>>             >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>             >
>>             >
>>             >
>>             >     --
>>             >     Jason
>>             >
>>             >
>>
>>
>>
>>             --
>>             Jason
>>
>>
>>
>>
>>
>>
>
>
>
> --
> Jason



-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Not timing out watcher

Reply via email to