Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

Wolfgang Lendl Fri, 07 Sep 2018 01:50:00 -0700

Hi,

the problem still exists
for me, this happens to SSD OSDs only - I recreated all of them running 12.2.8


this is what i got even on newly created OSDs after some time and crashes
ceph-bluestore-tool fsck -l /root/fsck-osd.0.log --log-level=20 --path 
/var/lib/ceph/osd/ceph-0 --deep on

2018-09-05 10:15:42.784873 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x34dbe4
2018-09-05 10:15:42.818239 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x376ccf
2018-09-05 10:15:42.863419 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3a4e58
2018-09-05 10:15:42.887404 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3b7f29
2018-09-05 10:15:42.958417 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3df760
2018-09-05 10:15:42.961275 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3e076f
2018-09-05 10:15:43.038658 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3ff156

I don't know if these errors are the reason for the OSD crashes or the result 
of it
currently I'm trying to catch some verbose logs

see also Radoslaws reply below

>This looks quite similar to #25001 [1]. The corruption *might* be caused by
>the racy SharedBlob::put() [2] that was fixed in 12.2.6. However, more logs
>(debug_bluestore=20, debug_bdev=20) would be useful. Also you might
>want to carefully use fsck --  please take a look on the Igor's (CCed) post
>and Troy's response.
>
>Best regards,
>Radoslaw Zarzynski
>
>[1] http://tracker.ceph.com/issues/25001
>[2] http://tracker.ceph.com/issues/24211
>[3] http://tracker.ceph.com/issues/25001#note-6

I'll keep you updated
br wolfgang



On 2018-09-06 09:27, Caspar Smit wrote:
> Hi,
>
> These reports are kind of worrying since we have a 12.2.5 cluster too
> waiting to upgrade. Did you have a luck with upgrading to 12.2.8 or
> still the same behavior?
> Is there a bugtracker for this issue?
>
> Kind regards,
> Caspar
>
> Op di 4 sep. 2018 om 09:59 schreef Wolfgang Lendl
> <wolfgang.le...@meduniwien.ac.at
> <mailto:wolfgang.le...@meduniwien.ac.at>>:
>
>     is downgrading from 12.2.7 to 12.2.5 an option? - I'm still suffering
>     from high frequent osd crashes.
>     my hopes are with 12.2.9 - but hope wasn't always my best strategy
>
>     br
>     wolfgang
>
>     On 2018-08-30 19:18, Alfredo Deza wrote:
>     > On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
>     > <wolfgang.le...@meduniwien.ac.at
>     <mailto:wolfgang.le...@meduniwien.ac.at>> wrote:
>     >> Hi Alfredo,
>     >>
>     >>
>     >> caught some logs:
>     >> https://pastebin.com/b3URiA7p
>     > That looks like there is an issue with bluestore. Maybe Radoslaw or
>     > Adam might know a bit more.
>     >
>     >
>     >> br
>     >> wolfgang
>     >>
>     >> On 2018-08-29 15:51, Alfredo Deza wrote:
>     >>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>     >>> <wolfgang.le...@meduniwien.ac.at
>     <mailto:wolfgang.le...@meduniwien.ac.at>> wrote:
>     >>>> Hi,
>     >>>>
>     >>>> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm
>     experiencing random crashes from SSD OSDs (bluestore) - it seems
>     that HDD OSDs are not affected.
>     >>>> I destroyed and recreated some of the SSD OSDs which seemed
>     to help.
>     >>>>
>     >>>> this happens on centos 7.5 (different kernels tested)
>     >>>>
>     >>>> /var/log/messages:
>     >>>> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation
>     fault) **
>     >>>> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700
>     thread_name:bstore_kv_final
>     >>>> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470]
>     general protection ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in
>     libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>     >>>> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process
>     exited, code=killed, status=11/SEGV
>     >>>> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered
>     failed state.
>     >>>> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>     >>>> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time
>     over, scheduling restart.
>     >>>> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon
>     osd.2...
>     >>>> Aug 29 10:24:28  systemd: Started Ceph object storage daemon
>     osd.2.
>     >>>> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data
>     /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>     >>>> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation
>     fault) **
>     >>>> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700
>     thread_name:tp_osd_tp
>     >>>> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general
>     protection ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in
>     libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>     >>>> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process
>     exited, code=killed, status=11/SEGV
>     >>>> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered
>     failed state.
>     >>>> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
>     >>> These systemd messages aren't usually helpful, try poking around
>     >>> /var/log/ceph/ for the output on that one OSD.
>     >>>
>     >>> If those logs aren't useful either, try bumping up the
>     verbosity (see
>     >>>
>     
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
>     >>> )
>     >>>> did I hit a known issue?
>     >>>> any suggestions are highly appreciated
>     >>>>
>     >>>>
>     >>>> br
>     >>>> wolfgang
>     >>>>
>     >>>>
>     >>>>
>     >>>> _______________________________________________
>     >>>> ceph-users mailing list
>     >>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>>>
>     >> --
>     >> Wolfgang Lendl
>     >> IT Systems & Communications
>     >> Medizinische Universität Wien
>     >> Spitalgasse 23 / BT 88 /Ebene 00
>     >> A-1090 Wien
>     >> Tel: +43 1 40160-21231
>     >> Fax: +43 1 40160-921200
>     >>
>     >>
>
>     -- 
>     Wolfgang Lendl
>     IT Systems & Communications
>     Medizinische Universität Wien
>     Spitalgasse 23 / BT 88 /Ebene 00
>     A-1090 Wien
>     Tel: +43 1 40160-21231
>     Fax: +43 1 40160-921200
>
>
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Wolfgang Lendl
IT Systems & Communications
Medizinische Universität Wien
Spitalgasse 23 / BT 88 /Ebene 00
A-1090 Wien
Tel: +43 1 40160-21231
Fax: +43 1 40160-921200

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

Reply via email to