Re: Very slow recovery/peering with latest master

Handzik, Joe Mon, 28 Sep 2015 05:22:21 -0700

That's really good info, thanks for tracking that down. Do you expect this to 
be a common configuration going forward in Ceph deployments?


Joe

> On Sep 28, 2015, at 3:43 AM, Somnath Roy <somnath....@sandisk.com> wrote:
> 
> Xiaoxi,
> Thanks for giving me some pointers.
> Now, with the help of strace I am able to figure out why it is taking so long 
> in my setup to complete blkid* calls.
> In my case, the partitions are showing properly even if it is connected to 
> JBOD controller.
> 
> root@emsnode10:~/wip-write-path-optimization/src/os# strace -t -o 
> /root/strace_blkid.txt blkid
> /dev/sda1: UUID="d2060642-1af4-424f-9957-6a8dc77ff301" TYPE="ext4"
> /dev/sda5: UUID="2a987cc0-e3cd-43d4-99cd-b8d8e58617e7" TYPE="swap"
> /dev/sdy2: UUID="0ebd1631-52e7-4dc2-8bff-07102b877bfc" TYPE="xfs"
> /dev/sdw2: UUID="29f1203b-6f44-45e3-8f6a-8ad1d392a208" TYPE="xfs"
> /dev/sdt2: UUID="94f6bb55-ac61-499c-8552-600581e13dfa" TYPE="xfs"
> /dev/sdr2: UUID="b629710e-915d-4c56-b6a5-4782e6d6215d" TYPE="xfs"
> /dev/sdv2: UUID="69623b7f-9036-4a35-8298-dc7f5cecdb21" TYPE="xfs"
> /dev/sds2: UUID="75d941c5-a85c-4c37-b409-02de34483314" TYPE="xfs"
> /dev/sdx: UUID="cc84bc66-208b-4387-8470-071ec71532f2" TYPE="xfs"
> /dev/sdu2: UUID="c9817831-8362-48a9-9a6c-920e0f04d029" TYPE="xfs"
> 
> But, it is taking time on the drives those are not reserved for this host. 
> Basically, I am using 2 heads in front of a JBOF and I am using sg_persist to 
> reserve the drives between 2 hosts.
> Here is the strace output of blkid.
> 
> http://pastebin.com/qz2Z7Phj
> 
> You can see lot of input/output errors on accessing the drives which are not 
> reserved for this host.
> 
> This is an inefficiency part of blkid* calls (?) since calls like 
> fdisk/lsscsi are not taking time.
> 
> Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com]
> Sent: Monday, September 28, 2015 1:02 AM
> To: Somnath Roy; Podoski, Igor
> Cc: Samuel Just; Samuel Just (sam.j...@inktank.com); ceph-devel; Sage Weil; 
> Handzik, Joe
> Subject: RE: Very slow recovery/peering with latest master
> 
> FWIW, blkid works well in both GPT(created by parted) and MSDOS(created by 
> fdisk) in my environment.
> 
> But blkid doesn't show the information of disk in external bay (which is 
> connected by a JBOD controller) in my setup.
> 
> See below, SDB and SDH are SSDs attached to the front panel but the rest osd 
> disks(0-9) are from an external bay.
> 
> /dev/sdc       976285652 294887592 681398060  31% 
> /var/lib/ceph/mnt/osd-device-0-data
> /dev/sdd       976285652 269840116 706445536  28% 
> /var/lib/ceph/mnt/osd-device-1-data
> /dev/sde       976285652 257610832 718674820  27% 
> /var/lib/ceph/mnt/osd-device-2-data
> /dev/sdf       976285652 293460620 682825032  31% 
> /var/lib/ceph/mnt/osd-device-3-data
> /dev/sdg       976285652 294444100 681841552  31% 
> /var/lib/ceph/mnt/osd-device-4-data
> /dev/sdi       976285652 288416840 687868812  30% 
> /var/lib/ceph/mnt/osd-device-5-data
> /dev/sdj       976285652 273090960 703194692  28% 
> /var/lib/ceph/mnt/osd-device-6-data
> /dev/sdk       976285652 302720828 673564824  32% 
> /var/lib/ceph/mnt/osd-device-7-data
> /dev/sdl       976285652 268207968 708077684  28% 
> /var/lib/ceph/mnt/osd-device-8-data
> /dev/sdm       976285652 293316752 682968900  31% 
> /var/lib/ceph/mnt/osd-device-9-data
> /dev/sdb1      292824376  10629024 282195352   4% 
> /var/lib/ceph/mnt/osd-device-40-data
> /dev/sdh1      292824376  11413956 281410420   4% 
> /var/lib/ceph/mnt/osd-device-41-data
> 
> 
> 
> root@osd1:~# blkid
> /dev/sdb1: UUID="907806fe-1d29-4ef7-ad11-5a933a11601e" TYPE="xfs"
> /dev/sdh1: UUID="9dfe68ac-f297-4a02-8d21-50c194af4ff2" TYPE="xfs"
> /dev/sda1: UUID="cdf945ce-a345-4766-b89e-cecc33689016" TYPE="ext4"
> /dev/sda2: UUID="7a565029-deb9-4e68-835c-f097c2b1514e" TYPE="ext4"
> /dev/sda5: UUID="e61bfc35-932d-442f-a5ca-795897f62744" TYPE="swap"
> 
> 
> 
>> -----Original Message-----
>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>> ow...@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Friday, September 25, 2015 12:09 AM
>> To: Podoski, Igor
>> Cc: Samuel Just; Samuel Just (sam.j...@inktank.com); ceph-devel; Sage
>> Weil; Handzik, Joe
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>> Yeah , Igor may be..
>> Meanwhile, I am able to get gdb trace of the hang..
>> 
>> (gdb) bt
>> #0  0x00007f6f6bf043bd in read () at
>> ../sysdeps/unix/syscall-template.S:81
>> #1  0x00007f6f6af3b066 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #2  0x00007f6f6af43ae2 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #3  0x00007f6f6af42788 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #4  0x00007f6f6af42a53 in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #5  0x00007f6f6af3c17b in blkid_do_safeprobe () from
>> /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #6  0x00007f6f6af3e0c4 in blkid_verify () from /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #7  0x00007f6f6af387fb in blkid_get_dev () from /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #8  0x00007f6f6af38acb in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #9  0x00007f6f6af3946d in ?? () from
>> /lib/x86_64-linux-gnu/libblkid.so.1
>> #10 0x00007f6f6af39892 in blkid_probe_all_new () from
>> /lib/x86_64-linux-
>> gnu/libblkid.so.1
>> #11 0x00007f6f6af3dc10 in blkid_find_dev_with_tag () from /lib/x86_64-
>> linux-gnu/libblkid.so.1
>> #12 0x00007f6f6d3bf923 in get_device_by_uuid (dev_uuid=...,
>> label=label@entry=0x7f6f6d535fe5 "PARTUUID",
>> partition=partition@entry=0x7f6f347eb5a0 "",
>> device=device@entry=0x7f6f347ec5a0 "")
>>    at common/blkdev.cc:193
>> #13 0x00007f6f6d147de5 in FileStore::collect_metadata
>> (this=0x7f6f68893000,
>> pm=0x7f6f21419598) at os/FileStore.cc:660
>> #14 0x00007f6f6cebfa9a in OSD::_collect_metadata
>> (this=this@entry=0x7f6f6894f000, pm=pm@entry=0x7f6f21419598) at
>> osd/OSD.cc:4586
>> #15 0x00007f6f6cec0614 in OSD::_send_boot
>> (this=this@entry=0x7f6f6894f000) at osd/OSD.cc:4568
>> #16 0x00007f6f6cec203a in OSD::_maybe_boot (this=0x7f6f6894f000,
>> oldest=1, newest=100) at osd/OSD.cc:4463
>> #17 0x00007f6f6cefc5e1 in Context::complete (this=0x7f6f3d3864e0,
>> r=<optimized out>) at ./include/Context.h:64
>> #18 0x00007f6f6d2eed08 in Finisher::finisher_thread_entry
>> (this=0x7ffee7272d70) at common/Finisher.cc:65
>> #19 0x00007f6f6befd182 in start_thread (arg=0x7f6f347ee700) at
>> pthread_create.c:312
>> #20 0x00007f6f6a24347d in clone ()
>> at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>> 
>> 
>> Strace was not helpful much since other threads are not block and keep
>> printing the futex traces..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: Podoski, Igor [mailto:igor.podo...@ts.fujitsu.com]
>> Sent: Wednesday, September 23, 2015 11:33 PM
>> To: Somnath Roy
>> Cc: Samuel Just; Samuel Just (sam.j...@inktank.com); ceph-devel; Sage
>> Weil; Handzik, Joe
>> Subject: RE: Very slow recovery/peering with latest master
>> 
>>> -----Original Message-----
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Thursday, September 24, 2015 3:32 AM
>>> To: Handzik, Joe
>>> Cc: Somnath Roy; Samuel Just; Samuel Just (sam.j...@inktank.com);
>>> ceph- devel
>>> Subject: Re: Very slow recovery/peering with latest master
>>> 
>>>> On Wed, 23 Sep 2015, Handzik, Joe wrote:
>>>> Ok. When configuring with ceph-disk, it does something nifty and
>>>> actually gives the OSD the uuid of the disk's partition as its fsid.
>>>> I bootstrap off that to get an argument to pass into the function
>>>> you have identified as the bottleneck. I ran it by sage and we
>>>> both realized there would be cases where it wouldn't work...I'm
>>>> sure neither of us realized the failure would take three minutes though.
>>>> 
>>>> In the short term, it makes sense to create an option to disable
>>>> or short-circuit the blkid code. I would prefer that the default
>>>> be left with the code enabled, but I'm open to default disabled if
>>>> others think this will be a widespread problem. You could also
>>>> make sure your OSD fsids are set to match your disk partition
>>>> uuids for now too, if that's a faster workaround for you (it'll get rid of 
>>>> the failure).
>>> 
>>> I think we should try to figure out where it is hanging.  Can you
>>> strace the blkid process to see what it is up to?
>>> 
>>> I opened http://tracker.ceph.com/issues/13219
>>> 
>>> I think as long as it behaves reliably with ceph-disk OSDs then we
>>> can have it on by default.
>>> 
>>> sage
>>> 
>>> 
>>>> 
>>>> Joe
>>>> 
>>>>> On Sep 23, 2015, at 6:26 PM, Somnath Roy
>>>>> <somnath....@sandisk.com>
>>> wrote:
>>>>> 
>>>>> <<inline
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Handzik, Joe [mailto:joseph.t.hand...@hpe.com]
>>>>> Sent: Wednesday, September 23, 2015 4:20 PM
>>>>> To: Samuel Just
>>>>> Cc: Somnath Roy; Samuel Just (sam.j...@inktank.com); Sage Weil
>>>>> (s...@newdream.net); ceph-devel
>>>>> Subject: Re: Very slow recovery/peering with latest master
>>>>> 
>>>>> I added that, there is code up the stack in calamari that
>>>>> consumes the
>>> path provided, which is intended in the future to facilitate disk
>>> monitoring and management.
>>>>> 
>>>>> [Somnath] Ok
>>>>> 
>>>>> Somnath, what does your disk configuration look like
>>>>> (filesystem,
>>> SSD/HDD, anything else you think could be relevant)? Did you
>>> configure your disks with ceph-disk, or by hand? I never saw this
>>> while testing my code, has anyone else heard of this behavior on
>>> master? The code has been in master for 2-3 months now I believe.
>>>>> [Somnath] All SSD , I use mkcephfs to create cluster , I
>>>>> partitioned the
>>> disk with fdisk beforehand. I am using XFS. Are you trying with
>>> Ubuntu
>>> 3.16.* kernel ? It could be Linux distribution/kernel specific.
>> 
>> Somnath, maybe it is GPT related, what partition table do you have? I
>> think parted and gdisk can create GPT partitions, but not fdisk
>> (definitely not in version that I use).
>> 
>> You could backup and clear blkid cache /etc/blkid/blkid.tab, maybe
>> there is a mess.
>> 
>> Regards,
>> Igor.
>> 
>> 
>>>>> 
>>>>> It would be nice to not need to disable this, but if this
>>>>> behavior exists and
>>> can't be explained by a misconfiguration or something else I'll need
>>> to figure out a different implementation.
>>>>> 
>>>>> Joe
>>>>> 
>>>>>> On Sep 23, 2015, at 6:07 PM, Samuel Just <sj...@redhat.com> wrote:
>>>>>> 
>>>>>> Wow.  Why would that take so long?  I think you are correct
>>>>>> that it's only used for metadata, we could just add a config
>>>>>> value to disable it.
>>>>>> -Sam
>>>>>> 
>>>>>>> On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy
>>> <somnath....@sandisk.com> wrote:
>>>>>>> Sam/Sage,
>>>>>>> I debugged it down and found out that the get_device_by_uuid-
>>>> blkid_find_dev_with_tag() call within FileStore::collect_metadata()
>>>> is
>>> hanging for ~3 mins before returning a EINVAL. I saw this portion is
>>> newly added after hammer.
>>>>>>> Commenting it out resolves the issue. BTW, I saw this value is
>>>>>>> stored as
>>> metadata but not used anywhere , am I missing anything ?
>>>>>>> Here is my Linux details..
>>>>>>> 
>>>>>>> root@emsnode5:~/wip-write-path-optimization/src# uname -a
>> Linux
>>>>>>> emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8
>>>>>>> 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>> 
>>>>>>> 
>>>>>>> root@emsnode5:~/wip-write-path-optimization/src# lsb_release
>>>>>>> -a
>>> No
>>>>>>> LSB modules are available.
>>>>>>> Distributor ID: Ubuntu
>>>>>>> Description:    Ubuntu 14.04.2 LTS
>>>>>>> Release:        14.04
>>>>>>> Codename:       trusty
>>>>>>> 
>>>>>>> Thanks & Regards
>>>>>>> Somnath
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Somnath Roy
>>>>>>> Sent: Wednesday, September 16, 2015 2:20 PM
>>>>>>> To: 'Gregory Farnum'
>>>>>>> Cc: 'ceph-devel'
>>>>>>> Subject: RE: Very slow recovery/peering with latest master
>>>>>>> 
>>>>>>> 
>>>>>>> Sage/Greg,
>>>>>>> 
>>>>>>> Yeah, as we expected, it is not happening probably because of
>>> recovery settings. I reverted it back in my ceph.conf , but, still
>>> seeing this problem.
>>>>>>> 
>>>>>>> Some observation :
>>>>>>> ----------------------
>>>>>>> 
>>>>>>> 1. First of all, I don't think it is something related to my
>>>>>>> environment. I
>>> recreated the cluster with Hammer and this problem is not there.
>>>>>>> 
>>>>>>> 2. I have enabled the messenger/monclient log (Couldn't attach
>>>>>>> here)
>>> in one of the OSDs and found monitor is taking long time to detect
>>> the up OSDs. If you see the log, I have started OSD at 2015-09-16
>>> 16:13:07.042463 , but, there is no communication (only getting
>>> KEEP_ALIVE) till 2015-09-16
>>> 16:16:07.180482 , so, 3 mins !!
>>>>>>> 
>>>>>>> 3. During this period, I saw monclient trying to communicate
>>>>>>> with
>>> monitor but not able to probably. It is sending osd_boot at
>>> 2015-09-16
>>> 16:16:07.180482 only..
>>>>>>> 
>>>>>>> 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient:
>>>>>>> _send_mon_message to mon.a at 10.60.194.10:6789/0
>>>>>>> 2015-09-16 16:16:07.180482 7f65377fe700  1 --
>>>>>>> 10.60.194.10:6820/20102
>>>>>>> --> 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features
>>>>>>> 72057594037927935 v45) v6 -- ?+0 0x7f6523c19100 con
>>>>>>> 0x7f6542045680
>>>>>>> 2015-09-16 16:16:07.180496 7f65377fe700 20 --
>>>>>>> 10.60.194.10:6820/20102
>>> submit_message osd_boot(osd.10 booted 0 features 72057594037927935
>>> v45) v6 remote, 10.60.194.10:6789/0, have pipe.
>>>>>>> 
>>>>>>> 4. BTW, the osd down scenario is detected very quickly (ceph
>>>>>>> -w
>>> output) , problem is during coming up I guess.
>>>>>>> 
>>>>>>> 
>>>>>>> So, something related to mon communication getting slower ?
>>>>>>> Let me know if more verbose logging is required and how should
>>>>>>> I
>>> share the log..
>>>>>>> 
>>>>>>> Thanks & Regards
>>>>>>> Somnath
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>>>>>>> Sent: Wednesday, September 16, 2015 11:35 AM
>>>>>>> To: Somnath Roy
>>>>>>> Cc: ceph-devel
>>>>>>> Subject: Re: Very slow recovery/peering with latest master
>>>>>>> 
>>>>>>>> On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy
>>> <somnath....@sandisk.com> wrote:
>>>>>>>> Hi,
>>>>>>>> I am seeing very slow recovery when I am adding OSDs with the
>>>>>>>> latest
>>> master.
>>>>>>>> Also, If I just restart all the OSDs (no IO is going on in
>>>>>>>> the
>>>>>>>> cluster) ,
>>> cluster is taking a significant amount of time to reach in
>>> active+clean state (and even detecting all the up OSDs).
>>>>>>>> 
>>>>>>>> I saw the recovery/backfill default parameters are now
>>>>>>>> changed (to
>>> lower value) , this probably explains the recovery scenario , but,
>>> will it affect the peering time during OSD startup as well ?
>>>>>>> 
>>>>>>> I don't think these values should impact peering time, but you
>>>>>>> could
>>> configure them back to the old defaults and see if it changes.
>>>>>>> -Greg
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> 
>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution,
>>> or copying of this message is strictly prohibited. If you have
>>> received this communication in error, please notify the sender by
>>> telephone or e-mail (as shown above) immediately and destroy any and
>>> all copies of this message in your possession (whether hard copies
>>> or electronically
>> stored copies).
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majord...@vger.kernel.org More
>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majord...@vger.kernel.org More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More
>> majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>> 
>> ________________________________
>> 
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically 
>> stored copies).
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Very slow recovery/peering with latest master

Reply via email to