[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
On Mon, Dec 2, 2019 at 4:55 PM Simon Ironside wrote: > > Any word on 14.2.5? Nervously waiting here . . . real soon, the release is 99% done (check the corresponding thread on the devel mailing list) Paul > > Thanks, > Simon. > > On 18/11/2019 11:29, Simon Ironside wrote: > > > I will sit tight and wait for 14.2.5. > > > > Thanks again, > > Simon. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Any word on 14.2.5? Nervously waiting here . . . Thanks, Simon. On 18/11/2019 11:29, Simon Ironside wrote: I will sit tight and wait for 14.2.5. Thanks again, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi Igor, Thanks very much for providing all this detail. On 18/11/2019 10:43, Igor Fedotov wrote: - Check how full their DB devices are? For your case it makes sense to check this. And then safely wait for 14.2.5 if its not full. bluefs.db_used_bytes / bluefs_db_total_bytes is only around 1-2% (I am almost exclusively RBD and using a 64GB DB/WAL partition) and bluefs_slow_used_bytes is 0 on them all so it would seem I have little to worry about here with an essentially zero chance of corruption so far. I will sit tight and wait for 14.2.5. Thanks again, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi Simon, On 11/15/2019 6:02 PM, Simon Ironside wrote: Hi Igor, On 15/11/2019 14:22, Igor Fedotov wrote: Do you mean both standalone DB and(!!) standalone WAL devices/partitions by having SSD DB/WAL? No, 1x combined DB/WAL partition on an SSD and 1x data partition on an HDD per OSD. I.e. created like: ceph-deploy osd create --data /dev/sda --block-db ssd0/ceph-db-disk0 ceph-deploy osd create --data /dev/sdb --block-db ssd0/ceph-db-disk1 ceph-deploy osd create --data /dev/sdc --block-db ssd0/ceph-db-disk2 --block-wal wasn't used. If so then BlueFS might eventually overwrite some data at you DB volume with BlueFS log content. Which most probably makes OSD crash and unable to restart one day. This is quite random and not very frequent event which is to some degree dependent from cluster loading. And the period between actual data corruption and any evidence of this is non-zero most of the time - we tend to see it mostly when RocksDB was performing compaction. So this, if I've understood you correctly, is for those with 3 separate (DB + WAL + Data) devices per OSD. Not my setup. right Other OSD configuration which might suffer from the issue is main device + WAL devices. Much less failure probability exists for main + DB layout. It requires almost full DB to get any chances to appear. This sounds like my setup: 2 separate (DB/WAL combined + Data) devices per OSD. yep Main-only device configurations aren't under the threat as far as I can tell. And this is for all-in-one devices that aren't at risk. Understood. While we're waiting for 14.2.5 to be released, what should 14.2.3/4 users with an at risk setup do in the meantime, if anything? - Check how full their DB devices are? For your case it makes sense to check this. And then safely wait for 14.2.5 if its not full. - Avoid adding new data/load to the cluster? this is probably the last resort when you already start seeing this issue and is absolutely uncomfortable with data loss probability. Not a panacea anyway though as one can have already broken data but still undiscovered data corruption at multiple OSDs b. - Would deep scrubbing detect any undiscovered corruption? May be. We tend to see it during DB compaction (mostly triggered by DB write access) but IMO it can be detected during scrubbing and/or store fsck as well. - Get backups ready to restore? I mean, how bad is this? As per multiple reports there are some chances to lose OSD data. E.g. we've got reports about reproducing 1-2 OSD failures per day under some stress(!!!) loading testing. That's probably not the general case and production clusters might suffer from this much less frequently. E.g. for our multiple QA activities we've observed the issue just once since it had been introduced. Anyway it's possible to lose multiple OSDs simultaneously. Probability is rather not that large but it's definitely non-zero. But as fix is almost ready I'd recommend to wait for it and apply ASAP. Thanks, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi Igor, On 15/11/2019 14:22, Igor Fedotov wrote: Do you mean both standalone DB and(!!) standalone WAL devices/partitions by having SSD DB/WAL? No, 1x combined DB/WAL partition on an SSD and 1x data partition on an HDD per OSD. I.e. created like: ceph-deploy osd create --data /dev/sda --block-db ssd0/ceph-db-disk0 ceph-deploy osd create --data /dev/sdb --block-db ssd0/ceph-db-disk1 ceph-deploy osd create --data /dev/sdc --block-db ssd0/ceph-db-disk2 --block-wal wasn't used. If so then BlueFS might eventually overwrite some data at you DB volume with BlueFS log content. Which most probably makes OSD crash and unable to restart one day. This is quite random and not very frequent event which is to some degree dependent from cluster loading. And the period between actual data corruption and any evidence of this is non-zero most of the time - we tend to see it mostly when RocksDB was performing compaction. So this, if I've understood you correctly, is for those with 3 separate (DB + WAL + Data) devices per OSD. Not my setup. Other OSD configuration which might suffer from the issue is main device + WAL devices. Much less failure probability exists for main + DB layout. It requires almost full DB to get any chances to appear. This sounds like my setup: 2 separate (DB/WAL combined + Data) devices per OSD. Main-only device configurations aren't under the threat as far as I can tell. And this is for all-in-one devices that aren't at risk. Understood. While we're waiting for 14.2.5 to be released, what should 14.2.3/4 users with an at risk setup do in the meantime, if anything? - Check how full their DB devices are? - Avoid adding new data/load to the cluster? - Would deep scrubbing detect any undiscovered corruption? - Get backups ready to restore? I mean, how bad is this? Thanks, Simon. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi Simon, Do you mean both standalone DB and(!!) standalone WAL devices/partitions by having SSD DB/WAL? If so then BlueFS might eventually overwrite some data at you DB volume with BlueFS log content. Which most probably makes OSD crash and unable to restart one day. This is quite random and not very frequent event which is to some degree dependent from cluster loading. And the period between actual data corruption and any evidence of this is non-zero most of the time - we tend to see it mostly when RocksDB was performing compaction. Other OSD configuration which might suffer from the issue is main device + WAL devices. Much less failure probability exists for main + DB layout. It requires almost full DB to get any chances to appear. Main-only device configurations aren't under the threat as far as I can tell. Thanks, Igor On 11/15/2019 12:40 PM, Simon Ironside wrote: Hi, I have two new-ish 14.2.4 clusters that began life on 14.2.0 , all with HDD OSDs with SSD DB/WALs but neither have experienced obvious problems yet. What's the impact of this? Does possible data corruption mean possible silent data corruption? Or does the corruption cause the OSD failures mentioned on the tracker and you're basically ok if you either haven't had a failure or if you keep on top of failures the way you would if they were normal disk failures? Thanks, Simon On 14/11/2019 16:10, Sage Weil wrote: Hi everyone, We've identified a data corruption bug[1], first introduced[2] (by yours truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption appears as a rocksdb checksum error or assertion that looks like os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= allocated) or in some cases a rocksdb checksum error. It only affects BlueStore OSDs that have a separate 'db' or 'wal' device. We have a fix[3] that is working its way through testing, and will expedite the next Nautilus point release (14.2.5) once it is ready. If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with separate 'db' volumes, you should consider waiting to upgrade until 14.2.5 is released. A big thank you to Igor Fedotov and several *extremely* helpful users who managed to reproduce and track down this problem! sage [1] https://tracker.ceph.com/issues/42223 [2] https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad7#diff-618db1d3389289a9d25840a4500ef0b0 [3] https://github.com/ceph/ceph/pull/31621 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Hi, I have two new-ish 14.2.4 clusters that began life on 14.2.0 , all with HDD OSDs with SSD DB/WALs but neither have experienced obvious problems yet. What's the impact of this? Does possible data corruption mean possible silent data corruption? Or does the corruption cause the OSD failures mentioned on the tracker and you're basically ok if you either haven't had a failure or if you keep on top of failures the way you would if they were normal disk failures? Thanks, Simon On 14/11/2019 16:10, Sage Weil wrote: Hi everyone, We've identified a data corruption bug[1], first introduced[2] (by yours truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption appears as a rocksdb checksum error or assertion that looks like os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= allocated) or in some cases a rocksdb checksum error. It only affects BlueStore OSDs that have a separate 'db' or 'wal' device. We have a fix[3] that is working its way through testing, and will expedite the next Nautilus point release (14.2.5) once it is ready. If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with separate 'db' volumes, you should consider waiting to upgrade until 14.2.5 is released. A big thank you to Igor Fedotov and several *extremely* helpful users who managed to reproduce and track down this problem! sage [1] https://tracker.ceph.com/issues/42223 [2] https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad7#diff-618db1d3389289a9d25840a4500ef0b0 [3] https://github.com/ceph/ceph/pull/31621 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Possible data corruption with 14.2.3 and 14.2.4
Great job tracking this down to everyone involved! Mark On 11/14/19 10:10 AM, Sage Weil wrote: Hi everyone, We've identified a data corruption bug[1], first introduced[2] (by yours truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption appears as a rocksdb checksum error or assertion that looks like os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= allocated) or in some cases a rocksdb checksum error. It only affects BlueStore OSDs that have a separate 'db' or 'wal' device. We have a fix[3] that is working its way through testing, and will expedite the next Nautilus point release (14.2.5) once it is ready. If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with separate 'db' volumes, you should consider waiting to upgrade until 14.2.5 is released. A big thank you to Igor Fedotov and several *extremely* helpful users who managed to reproduce and track down this problem! sage [1] https://tracker.ceph.com/issues/42223 [2] https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad7#diff-618db1d3389289a9d25840a4500ef0b0 [3] https://github.com/ceph/ceph/pull/31621 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io