Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-09-07 Thread Wolfgang Lendl
Hello,

got new logs - if this snip is not sufficent, I can provide the full log

https://pastebin.com/dKBzL9AW

br+thx wolfgang


On 2018-09-05 01:55, Radoslaw Zarzynski wrote:
> In the log following trace can be found:
>
>  0> 2018-08-30 13:11:01.014708 7ff2dd344700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7ff2dd344700 thread_name:osd_srv_agent
>
>  ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
> luminous (stable)
>  1: (()+0xa48ec1) [0x5652900ffec1]
>  2: (()+0xf6d0) [0x7ff2f7c206d0]
>  3: (BlueStore::_wctx_finish(BlueStore::TransContext*,
> boost::intrusive_ptr&,
> boost::intrusive_ptr, BlueStore::WriteContext*,
> std::set,
> std::allocator >*)+0xb4) [0x56528ffe3954]
>  4: (BlueStore::_do_truncate(BlueStore::TransContext*,
> boost::intrusive_ptr&,
> boost::intrusive_ptr, unsigned long,
> std::set,
> std::allocator >*)+0x2c2) [0x56528fffd642]
>  5: (BlueStore::_do_remove(BlueStore::TransContext*,
> boost::intrusive_ptr&,
> boost::intrusive_ptr)+0xc6) [0x56528fffdf86]
>  6: (BlueStore::_remove(BlueStore::TransContext*,
> boost::intrusive_ptr&,
> boost::intrusive_ptr&)+0x94) [0x565289f4]
>  7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)+0x15af) [0x56529001280f]
>  8: ...
>
> This looks quite similar to #25001 [1]. The corruption *might* be caused by
> the racy SharedBlob::put() [2] that was fixed in 12.2.6. However, more logs
> (debug_bluestore=20, debug_bdev=20) would be useful. Also you might
> want to carefully use fsck --  please take a look on the Igor's (CCed) post
> and Troy's response.
>
> Best regards,
> Radoslaw Zarzynski
>
> [1] http://tracker.ceph.com/issues/25001
> [2] http://tracker.ceph.com/issues/24211
> [3] http://tracker.ceph.com/issues/25001#note-6
>
> On Tue, Sep 4, 2018 at 12:54 PM, Alfredo Deza  wrote:
>> On Tue, Sep 4, 2018 at 3:59 AM, Wolfgang Lendl
>>  wrote:
>>> is downgrading from 12.2.7 to 12.2.5 an option? - I'm still suffering
>>> from high frequent osd crashes.
>>> my hopes are with 12.2.9 - but hope wasn't always my best strategy
>> 12.2.8 just went out. I think that Adam or Radoslaw might have some
>> time to check those logs now
>>
>>> br
>>> wolfgang
>>>
>>> On 2018-08-30 19:18, Alfredo Deza wrote:
 On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
  wrote:
> Hi Alfredo,
>
>
> caught some logs:
> https://pastebin.com/b3URiA7p
 That looks like there is an issue with bluestore. Maybe Radoslaw or
 Adam might know a bit more.


> br
> wolfgang
>
> On 2018-08-29 15:51, Alfredo Deza wrote:
>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>>  wrote:
>>> Hi,
>>>
>>> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm 
>>> experiencing random crashes from SSD OSDs (bluestore) - it seems that 
>>> HDD OSDs are not affected.
>>> I destroyed and recreated some of the SSD OSDs which seemed to help.
>>>
>>> this happens on centos 7.5 (different kernels tested)
>>>
>>> /var/log/messages:
>>> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
>>> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 
>>> thread_name:bstore_kv_final
>>> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general 
>>> protection ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
>>> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>>> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
>>> code=killed, status=11/SEGV
>>> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
>>> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>>> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, 
>>> scheduling restart.
>>> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
>>> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
>>> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
>>> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>>> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
>>> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
>>> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
>>> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
>>> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>>> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
>>> code=killed, status=11/SEGV
>>> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
>>> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
>> These systemd messages aren't usually helpful, try poking around
>> /var/log/ceph/ for the output on that one OSD.
>>
>> If those logs aren't useful either, try bumping up the verbosity (see
>> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
>> )
>>> did I hit a known issue?
>>> any

Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-09-07 Thread Wolfgang Lendl
Hi,

the problem still exists
for me, this happens to SSD OSDs only - I recreated all of them running 12.2.8

this is what i got even on newly created OSDs after some time and crashes
ceph-bluestore-tool fsck -l /root/fsck-osd.0.log --log-level=20 --path 
/var/lib/ceph/osd/ceph-0 --deep on

2018-09-05 10:15:42.784873 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x34dbe4
2018-09-05 10:15:42.818239 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x376ccf
2018-09-05 10:15:42.863419 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3a4e58
2018-09-05 10:15:42.887404 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3b7f29
2018-09-05 10:15:42.958417 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3df760
2018-09-05 10:15:42.961275 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3e076f
2018-09-05 10:15:43.038658 7f609a311ec0 -1 
bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data 
for sbid 0x3ff156

I don't know if these errors are the reason for the OSD crashes or the result 
of it
currently I'm trying to catch some verbose logs

see also Radoslaws reply below

>This looks quite similar to #25001 [1]. The corruption *might* be caused by
>the racy SharedBlob::put() [2] that was fixed in 12.2.6. However, more logs
>(debug_bluestore=20, debug_bdev=20) would be useful. Also you might
>want to carefully use fsck --  please take a look on the Igor's (CCed) post
>and Troy's response.
>
>Best regards,
>Radoslaw Zarzynski
>
>[1] http://tracker.ceph.com/issues/25001
>[2] http://tracker.ceph.com/issues/24211
>[3] http://tracker.ceph.com/issues/25001#note-6

I'll keep you updated
br wolfgang



On 2018-09-06 09:27, Caspar Smit wrote:
> Hi,
>
> These reports are kind of worrying since we have a 12.2.5 cluster too
> waiting to upgrade. Did you have a luck with upgrading to 12.2.8 or
> still the same behavior?
> Is there a bugtracker for this issue?
>
> Kind regards,
> Caspar
>
> Op di 4 sep. 2018 om 09:59 schreef Wolfgang Lendl
>  >:
>
> is downgrading from 12.2.7 to 12.2.5 an option? - I'm still suffering
> from high frequent osd crashes.
> my hopes are with 12.2.9 - but hope wasn't always my best strategy
>
> br
> wolfgang
>
> On 2018-08-30 19:18, Alfredo Deza wrote:
> > On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
> >  > wrote:
> >> Hi Alfredo,
> >>
> >>
> >> caught some logs:
> >> https://pastebin.com/b3URiA7p
> > That looks like there is an issue with bluestore. Maybe Radoslaw or
> > Adam might know a bit more.
> >
> >
> >> br
> >> wolfgang
> >>
> >> On 2018-08-29 15:51, Alfredo Deza wrote:
> >>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
> >>>  > wrote:
>  Hi,
> 
>  after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm
> experiencing random crashes from SSD OSDs (bluestore) - it seems
> that HDD OSDs are not affected.
>  I destroyed and recreated some of the SSD OSDs which seemed
> to help.
> 
>  this happens on centos 7.5 (different kernels tested)
> 
>  /var/log/messages:
>  Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation
> fault) **
>  Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700
> thread_name:bstore_kv_final
>  Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470]
> general protection ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in
> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>  Aug 29 10:24:08  systemd: ceph-osd@2.service: main process
> exited, code=killed, status=11/SEGV
>  Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered
> failed state.
>  Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>  Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time
> over, scheduling restart.
>  Aug 29 10:24:28  systemd: Starting Ceph object storage daemon
> osd.2...
>  Aug 29 10:24:28  systemd: Started Ceph object storage daemon
> osd.2.
>  Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data
> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>  Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation
> fault) **
>  Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700
> thread_name:tp_osd_tp
>  Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general
> protection ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in
> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>  

Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-09-06 Thread Caspar Smit
Hi,

These reports are kind of worrying since we have a 12.2.5 cluster too
waiting to upgrade. Did you have a luck with upgrading to 12.2.8 or still
the same behavior?
Is there a bugtracker for this issue?

Kind regards,
Caspar

Op di 4 sep. 2018 om 09:59 schreef Wolfgang Lendl <
wolfgang.le...@meduniwien.ac.at>:

> is downgrading from 12.2.7 to 12.2.5 an option? - I'm still suffering
> from high frequent osd crashes.
> my hopes are with 12.2.9 - but hope wasn't always my best strategy
>
> br
> wolfgang
>
> On 2018-08-30 19:18, Alfredo Deza wrote:
> > On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
> >  wrote:
> >> Hi Alfredo,
> >>
> >>
> >> caught some logs:
> >> https://pastebin.com/b3URiA7p
> > That looks like there is an issue with bluestore. Maybe Radoslaw or
> > Adam might know a bit more.
> >
> >
> >> br
> >> wolfgang
> >>
> >> On 2018-08-29 15:51, Alfredo Deza wrote:
> >>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
> >>>  wrote:
>  Hi,
> 
>  after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm
> experiencing random crashes from SSD OSDs (bluestore) - it seems that HDD
> OSDs are not affected.
>  I destroyed and recreated some of the SSD OSDs which seemed to help.
> 
>  this happens on centos 7.5 (different kernels tested)
> 
>  /var/log/messages:
>  Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
>  Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700
> thread_name:bstore_kv_final
>  Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general
> protection ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in
> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>  Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited,
> code=killed, status=11/SEGV
>  Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed
> state.
>  Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>  Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over,
> scheduling restart.
>  Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
>  Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
>  Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data
> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>  Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
>  Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700
> thread_name:tp_osd_tp
>  Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection
> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in
> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>  Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited,
> code=killed, status=11/SEGV
>  Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed
> state.
>  Aug 29 10:24:35  systemd: ceph-osd@0.service failed
> >>> These systemd messages aren't usually helpful, try poking around
> >>> /var/log/ceph/ for the output on that one OSD.
> >>>
> >>> If those logs aren't useful either, try bumping up the verbosity (see
> >>>
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
> >>> )
>  did I hit a known issue?
>  any suggestions are highly appreciated
> 
> 
>  br
>  wolfgang
> 
> 
> 
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> >> --
> >> Wolfgang Lendl
> >> IT Systems & Communications
> >> Medizinische Universität Wien
> >> Spitalgasse 23 / BT 88 /Ebene 00
> >> A-1090 Wien
> >> Tel: +43 1 40160-21231
> >> Fax: +43 1 40160-921200
> >>
> >>
>
> --
> Wolfgang Lendl
> IT Systems & Communications
> Medizinische Universität Wien
> Spitalgasse 23 / BT 88 /Ebene 00
> A-1090 Wien
> Tel: +43 1 40160-21231
> Fax: +43 1 40160-921200
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-09-04 Thread Alfredo Deza
On Tue, Sep 4, 2018 at 3:59 AM, Wolfgang Lendl
 wrote:
> is downgrading from 12.2.7 to 12.2.5 an option? - I'm still suffering
> from high frequent osd crashes.
> my hopes are with 12.2.9 - but hope wasn't always my best strategy

12.2.8 just went out. I think that Adam or Radoslaw might have some
time to check those logs now

>
> br
> wolfgang
>
> On 2018-08-30 19:18, Alfredo Deza wrote:
>> On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
>>  wrote:
>>> Hi Alfredo,
>>>
>>>
>>> caught some logs:
>>> https://pastebin.com/b3URiA7p
>> That looks like there is an issue with bluestore. Maybe Radoslaw or
>> Adam might know a bit more.
>>
>>
>>> br
>>> wolfgang
>>>
>>> On 2018-08-29 15:51, Alfredo Deza wrote:
 On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
  wrote:
> Hi,
>
> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing 
> random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not 
> affected.
> I destroyed and recreated some of the SSD OSDs which seemed to help.
>
> this happens on centos 7.5 (different kernels tested)
>
> /var/log/messages:
> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 
> thread_name:bstore_kv_final
> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general 
> protection ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
> code=killed, status=11/SEGV
> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, 
> scheduling restart.
> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
> code=killed, status=11/SEGV
> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
 These systemd messages aren't usually helpful, try poking around
 /var/log/ceph/ for the output on that one OSD.

 If those logs aren't useful either, try bumping up the verbosity (see
 http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
 )
> did I hit a known issue?
> any suggestions are highly appreciated
>
>
> br
> wolfgang
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>>> --
>>> Wolfgang Lendl
>>> IT Systems & Communications
>>> Medizinische Universität Wien
>>> Spitalgasse 23 / BT 88 /Ebene 00
>>> A-1090 Wien
>>> Tel: +43 1 40160-21231
>>> Fax: +43 1 40160-921200
>>>
>>>
>
> --
> Wolfgang Lendl
> IT Systems & Communications
> Medizinische Universität Wien
> Spitalgasse 23 / BT 88 /Ebene 00
> A-1090 Wien
> Tel: +43 1 40160-21231
> Fax: +43 1 40160-921200
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-09-04 Thread Wolfgang Lendl
is downgrading from 12.2.7 to 12.2.5 an option? - I'm still suffering
from high frequent osd crashes.
my hopes are with 12.2.9 - but hope wasn't always my best strategy

br
wolfgang

On 2018-08-30 19:18, Alfredo Deza wrote:
> On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
>  wrote:
>> Hi Alfredo,
>>
>>
>> caught some logs:
>> https://pastebin.com/b3URiA7p
> That looks like there is an issue with bluestore. Maybe Radoslaw or
> Adam might know a bit more.
>
>
>> br
>> wolfgang
>>
>> On 2018-08-29 15:51, Alfredo Deza wrote:
>>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>>>  wrote:
 Hi,

 after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing 
 random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not 
 affected.
 I destroyed and recreated some of the SSD OSDs which seemed to help.

 this happens on centos 7.5 (different kernels tested)

 /var/log/messages:
 Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
 Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 
 thread_name:bstore_kv_final
 Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general protection 
 ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
 libtcmalloc.so.4.4.5[7f8a997a8000+46000]
 Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
 code=killed, status=11/SEGV
 Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
 Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
 Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, scheduling 
 restart.
 Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
 Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
 Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
 /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
 Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
 Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
 Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
 ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
 libtcmalloc.so.4.4.5[7f5f430cd000+46000]
 Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
 code=killed, status=11/SEGV
 Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
 Aug 29 10:24:35  systemd: ceph-osd@0.service failed
>>> These systemd messages aren't usually helpful, try poking around
>>> /var/log/ceph/ for the output on that one OSD.
>>>
>>> If those logs aren't useful either, try bumping up the verbosity (see
>>> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
>>> )
 did I hit a known issue?
 any suggestions are highly appreciated


 br
 wolfgang



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> --
>> Wolfgang Lendl
>> IT Systems & Communications
>> Medizinische Universität Wien
>> Spitalgasse 23 / BT 88 /Ebene 00
>> A-1090 Wien
>> Tel: +43 1 40160-21231
>> Fax: +43 1 40160-921200
>>
>>

-- 
Wolfgang Lendl
IT Systems & Communications
Medizinische Universität Wien
Spitalgasse 23 / BT 88 /Ebene 00
A-1090 Wien
Tel: +43 1 40160-21231
Fax: +43 1 40160-921200




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-08-30 Thread Alfredo Deza
On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
 wrote:
> Hi Alfredo,
>
>
> caught some logs:
> https://pastebin.com/b3URiA7p

That looks like there is an issue with bluestore. Maybe Radoslaw or
Adam might know a bit more.


>
> br
> wolfgang
>
> On 2018-08-29 15:51, Alfredo Deza wrote:
>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>>  wrote:
>>> Hi,
>>>
>>> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing 
>>> random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not 
>>> affected.
>>> I destroyed and recreated some of the SSD OSDs which seemed to help.
>>>
>>> this happens on centos 7.5 (different kernels tested)
>>>
>>> /var/log/messages:
>>> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
>>> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 
>>> thread_name:bstore_kv_final
>>> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general protection 
>>> ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
>>> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>>> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
>>> code=killed, status=11/SEGV
>>> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
>>> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>>> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, scheduling 
>>> restart.
>>> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
>>> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
>>> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
>>> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>>> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
>>> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
>>> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
>>> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
>>> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>>> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
>>> code=killed, status=11/SEGV
>>> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
>>> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
>> These systemd messages aren't usually helpful, try poking around
>> /var/log/ceph/ for the output on that one OSD.
>>
>> If those logs aren't useful either, try bumping up the verbosity (see
>> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
>> )
>>> did I hit a known issue?
>>> any suggestions are highly appreciated
>>>
>>>
>>> br
>>> wolfgang
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
> --
> Wolfgang Lendl
> IT Systems & Communications
> Medizinische Universität Wien
> Spitalgasse 23 / BT 88 /Ebene 00
> A-1090 Wien
> Tel: +43 1 40160-21231
> Fax: +43 1 40160-921200
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-08-30 Thread Wolfgang Lendl
Hi Alfredo,


caught some logs:
https://pastebin.com/b3URiA7p

br
wolfgang

On 2018-08-29 15:51, Alfredo Deza wrote:
> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>  wrote:
>> Hi,
>>
>> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing 
>> random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not 
>> affected.
>> I destroyed and recreated some of the SSD OSDs which seemed to help.
>>
>> this happens on centos 7.5 (different kernels tested)
>>
>> /var/log/messages:
>> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
>> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 thread_name:bstore_kv_final
>> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general protection 
>> ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
>> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
>> code=killed, status=11/SEGV
>> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
>> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, scheduling 
>> restart.
>> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
>> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
>> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
>> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
>> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
>> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
>> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
>> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
>> code=killed, status=11/SEGV
>> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
>> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
> These systemd messages aren't usually helpful, try poking around
> /var/log/ceph/ for the output on that one OSD.
>
> If those logs aren't useful either, try bumping up the verbosity (see
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
> )
>> did I hit a known issue?
>> any suggestions are highly appreciated
>>
>>
>> br
>> wolfgang
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
Wolfgang Lendl
IT Systems & Communications
Medizinische Universität Wien
Spitalgasse 23 / BT 88 /Ebene 00
A-1090 Wien
Tel: +43 1 40160-21231
Fax: +43 1 40160-921200




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs crashing after upgrade to 12.2.7

2018-08-29 Thread Alfredo Deza
On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
 wrote:
> Hi,
>
> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing 
> random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not 
> affected.
> I destroyed and recreated some of the SSD OSDs which seemed to help.
>
> this happens on centos 7.5 (different kernels tested)
>
> /var/log/messages:
> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 thread_name:bstore_kv_final
> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general protection 
> ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in 
> libtcmalloc.so.4.4.5[7f8a997a8000+46000]
> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, 
> code=killed, status=11/SEGV
> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, scheduling 
> restart.
> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data 
> /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection 
> ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in 
> libtcmalloc.so.4.4.5[7f5f430cd000+46000]
> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, 
> code=killed, status=11/SEGV
> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
> Aug 29 10:24:35  systemd: ceph-osd@0.service failed

These systemd messages aren't usually helpful, try poking around
/var/log/ceph/ for the output on that one OSD.

If those logs aren't useful either, try bumping up the verbosity (see
http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
)
>
> did I hit a known issue?
> any suggestions are highly appreciated
>
>
> br
> wolfgang
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com