[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-12-02 Thread Paul Emmerich
On Mon, Dec 2, 2019 at 4:55 PM Simon Ironside  wrote:
>
> Any word on 14.2.5? Nervously waiting here . . .

real soon, the release is 99% done (check the corresponding thread on
the devel mailing list)



Paul

>
> Thanks,
> Simon.
>
> On 18/11/2019 11:29, Simon Ironside wrote:
>
> > I will sit tight and wait for 14.2.5.
> >
> > Thanks again,
> > Simon.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-12-02 Thread Simon Ironside

Any word on 14.2.5? Nervously waiting here . . .

Thanks,
Simon.

On 18/11/2019 11:29, Simon Ironside wrote:


I will sit tight and wait for 14.2.5.

Thanks again,
Simon.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-11-18 Thread Simon Ironside

Hi Igor,

Thanks very much for providing all this detail.

On 18/11/2019 10:43, Igor Fedotov wrote:


- Check how full their DB devices are?
For your case it makes sense to check this. And then safely wait for 
14.2.5 if its not full.


bluefs.db_used_bytes / bluefs_db_total_bytes is only around 1-2% (I am 
almost exclusively RBD and using a 64GB DB/WAL partition) and 
bluefs_slow_used_bytes is 0 on them all so it would seem I have little 
to worry about here with an essentially zero chance of corruption so far.


I will sit tight and wait for 14.2.5.

Thanks again,
Simon.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-11-18 Thread Igor Fedotov

Hi Simon,

On 11/15/2019 6:02 PM, Simon Ironside wrote:

Hi Igor,

On 15/11/2019 14:22, Igor Fedotov wrote:

Do you mean both standalone DB and(!!) standalone WAL 
devices/partitions by having SSD DB/WAL?


No, 1x combined DB/WAL partition on an SSD and 1x data partition on an 
HDD per OSD. I.e. created like:


ceph-deploy osd create --data /dev/sda --block-db ssd0/ceph-db-disk0
ceph-deploy osd create --data /dev/sdb --block-db ssd0/ceph-db-disk1
ceph-deploy osd create --data /dev/sdc --block-db ssd0/ceph-db-disk2

--block-wal wasn't used.

If so then BlueFS might eventually overwrite some data at you DB 
volume with BlueFS log content. Which most probably makes OSD crash 
and unable to restart one day. This is quite random and not very 
frequent event which is to some degree dependent from cluster 
loading. And the period between actual data corruption and any 
evidence of this is non-zero most of the time - we tend to see it 
mostly when RocksDB was performing compaction.


So this, if I've understood you correctly, is for those with 3 
separate (DB + WAL + Data) devices per OSD. Not my setup.



right
Other OSD configuration which might suffer from the issue is main 
device + WAL devices.


Much less failure probability exists for main + DB layout. It 
requires almost full DB to get any chances to appear.


This sounds like my setup: 2 separate (DB/WAL combined + Data) devices 
per OSD.

yep


Main-only device configurations aren't under the threat as far as I 
can tell.


And this is for all-in-one devices that aren't at risk. Understood.

While we're waiting for 14.2.5 to be released, what should 14.2.3/4 
users with an at risk setup do in the meantime, if anything?


- Check how full their DB devices are?
For your case it makes sense to check this. And then safely wait for 
14.2.5 if its not full.

- Avoid adding new data/load to the cluster?
this is probably the last resort when you already start seeing this 
issue and is absolutely uncomfortable with data loss probability. Not a 
panacea anyway though as one can have already broken data but still 
undiscovered data corruption at multiple OSDs b.

- Would deep scrubbing detect any undiscovered corruption?


May be. We tend to see it during DB compaction (mostly triggered by DB 
write access) but IMO it can be detected during scrubbing and/or store 
fsck as well.




- Get backups ready to restore? I mean, how bad is this?


As per multiple reports there are some chances to lose OSD data. E.g. 
we've got reports about reproducing 1-2 OSD failures per day under some 
stress(!!!) loading testing. That's probably not the general case and 
production clusters might suffer from this much less frequently. E.g. 
for our multiple QA activities we've observed the issue just once since 
it had been introduced.


Anyway it's possible to lose multiple OSDs simultaneously. Probability 
is rather not that large but it's definitely non-zero.


But as fix is almost ready I'd recommend to wait for it and apply ASAP.



Thanks,
Simon.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-11-15 Thread Simon Ironside

Hi Igor,

On 15/11/2019 14:22, Igor Fedotov wrote:

Do you mean both standalone DB and(!!) standalone WAL devices/partitions 
by having SSD DB/WAL?


No, 1x combined DB/WAL partition on an SSD and 1x data partition on an 
HDD per OSD. I.e. created like:


ceph-deploy osd create --data /dev/sda --block-db ssd0/ceph-db-disk0
ceph-deploy osd create --data /dev/sdb --block-db ssd0/ceph-db-disk1
ceph-deploy osd create --data /dev/sdc --block-db ssd0/ceph-db-disk2

--block-wal wasn't used.

If so then BlueFS might eventually overwrite some data at you DB volume 
with BlueFS log content. Which most probably makes OSD crash and unable 
to restart one day. This is quite random and not very frequent event 
which is to some degree dependent from cluster loading. And the period 
between actual data corruption and any evidence of this is non-zero most 
of the time - we tend to see it mostly when RocksDB was performing 
compaction.


So this, if I've understood you correctly, is for those with 3 separate 
(DB + WAL + Data) devices per OSD. Not my setup.


Other OSD configuration which might suffer from the issue is main device 
+ WAL devices.


Much less failure probability exists for main + DB layout. It requires 
almost full DB to get any chances to appear.


This sounds like my setup: 2 separate (DB/WAL combined + Data) devices 
per OSD.


Main-only device configurations aren't under the threat as far as I can 
tell.


And this is for all-in-one devices that aren't at risk. Understood.

While we're waiting for 14.2.5 to be released, what should 14.2.3/4 
users with an at risk setup do in the meantime, if anything?


- Check how full their DB devices are?
- Avoid adding new data/load to the cluster?
- Would deep scrubbing detect any undiscovered corruption?
- Get backups ready to restore? I mean, how bad is this?

Thanks,
Simon.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-11-15 Thread Igor Fedotov

Hi Simon,

Do you mean both standalone DB and(!!) standalone WAL devices/partitions 
by having SSD DB/WAL?


If so then BlueFS might eventually overwrite some data at you DB volume 
with BlueFS log content. Which most probably makes OSD crash and unable 
to restart one day. This is quite random and not very frequent event 
which is to some degree dependent from cluster loading. And the period 
between actual data corruption and any evidence of this is non-zero most 
of the time - we tend to see it mostly when RocksDB was performing 
compaction.


Other OSD configuration which might suffer from the issue is main device 
+ WAL devices.


Much less failure probability exists for main + DB layout. It requires 
almost full DB to get any chances to appear.


Main-only device configurations aren't under the threat as far as I can 
tell.



Thanks,

Igor


On 11/15/2019 12:40 PM, Simon Ironside wrote:

Hi,

I have two new-ish 14.2.4 clusters that began life on 14.2.0 , all 
with HDD OSDs with SSD DB/WALs but neither have experienced obvious 
problems yet.


What's the impact of this? Does possible data corruption mean possible 
silent data corruption?
Or does the corruption cause the OSD failures mentioned on the tracker 
and you're basically ok if you either haven't had a failure or if you 
keep on top of failures the way you would if they were normal disk 
failures?


Thanks,
Simon

On 14/11/2019 16:10, Sage Weil wrote:

Hi everyone,

We've identified a data corruption bug[1], first introduced[2] (by yours
truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption
appears as a rocksdb checksum error or assertion that looks like

os/bluestore/fastbmap_allocator_impl.h: 750: FAILED 
ceph_assert(available >= allocated)


or in some cases a rocksdb checksum error.  It only affects BlueStore 
OSDs

that have a separate 'db' or 'wal' device.

We have a fix[3] that is working its way through testing, and will
expedite the next Nautilus point release (14.2.5) once it is ready.

If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with
separate 'db' volumes, you should consider waiting to upgrade
until 14.2.5 is released.

A big thank you to Igor Fedotov and several *extremely* helpful users 
who

managed to reproduce and track down this problem!

sage


[1] https://tracker.ceph.com/issues/42223
[2] 
https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad7#diff-618db1d3389289a9d25840a4500ef0b0

[3] https://github.com/ceph/ceph/pull/31621
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-11-15 Thread Simon Ironside

Hi,

I have two new-ish 14.2.4 clusters that began life on 14.2.0 , all with 
HDD OSDs with SSD DB/WALs but neither have experienced obvious problems yet.


What's the impact of this? Does possible data corruption mean possible 
silent data corruption?
Or does the corruption cause the OSD failures mentioned on the tracker 
and you're basically ok if you either haven't had a failure or if you 
keep on top of failures the way you would if they were normal disk failures?


Thanks,
Simon

On 14/11/2019 16:10, Sage Weil wrote:

Hi everyone,

We've identified a data corruption bug[1], first introduced[2] (by yours
truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption
appears as a rocksdb checksum error or assertion that looks like

os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= 
allocated)

or in some cases a rocksdb checksum error.  It only affects BlueStore OSDs
that have a separate 'db' or 'wal' device.

We have a fix[3] that is working its way through testing, and will
expedite the next Nautilus point release (14.2.5) once it is ready.

If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with
separate 'db' volumes, you should consider waiting to upgrade
until 14.2.5 is released.

A big thank you to Igor Fedotov and several *extremely* helpful users who
managed to reproduce and track down this problem!

sage


[1] https://tracker.ceph.com/issues/42223
[2] 
https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad7#diff-618db1d3389289a9d25840a4500ef0b0
[3] https://github.com/ceph/ceph/pull/31621
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4

2019-11-14 Thread Mark Nelson

Great job tracking this down to everyone involved!


Mark


On 11/14/19 10:10 AM, Sage Weil wrote:

Hi everyone,

We've identified a data corruption bug[1], first introduced[2] (by yours
truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption
appears as a rocksdb checksum error or assertion that looks like

os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= 
allocated)

or in some cases a rocksdb checksum error.  It only affects BlueStore OSDs
that have a separate 'db' or 'wal' device.

We have a fix[3] that is working its way through testing, and will
expedite the next Nautilus point release (14.2.5) once it is ready.

If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with
separate 'db' volumes, you should consider waiting to upgrade
until 14.2.5 is released.

A big thank you to Igor Fedotov and several *extremely* helpful users who
managed to reproduce and track down this problem!

sage


[1] https://tracker.ceph.com/issues/42223
[2] 
https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad7#diff-618db1d3389289a9d25840a4500ef0b0
[3] https://github.com/ceph/ceph/pull/31621
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io