[ceph-users] Re: Rocksdb: Corruption: missing start of fragmented record(1)

Dan van der Ster Wed, 01 Dec 2021 02:29:31 -0800

Hi Frank,

I'd be interested to read that paper, if you can find it again. I
don't understand why the volatile cache + fsync might be dangerous due
to a buggy firmware, but yet we should trust that a firmware respects
FUA when the volatile cache is disabled.


In https://github.com/ceph/ceph/pull/43848 we're documenting the
implications of WCE -- but in the context of performance, not safety.
If write through / volatile cache off is required for safety too, then
we should take a different approach (e.g. ceph could disable the write
cache itself).

Cheers, dan



On Tue, Nov 30, 2021 at 9:36 AM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dan.
>
> > ...however it is not unsafe to leave the cache enabled -- ceph uses
> > fsync appropriately to make the writes durable.
>
> Actually it is. You will rely on the drive's firmware to implement this 
> correctly and this is, unfortunately, less than a given. Within the last 
> one-two years somebody posted a link to a very interesting research paper to 
> this list, where drives were tested under real conditions. Turns out that the 
> "fsync to make writes persistent" is very vulnerable to power loss if 
> volatile write cache is enabled. It I remember correctly, about 1-2% of 
> drives ended up with data loss every time. In other words, for every drive 
> with volatile write cache enabled, every 100 power loss events you will have 
> 1-2 data loss events (in certain situations, the drive replies with ack 
> before the volatile cache is actually flushed). I think even PLP did not 
> prevent data loss in all cases.
>
> Its all down to bugs in firmware that fail to catch all corner cases and 
> internal race conditions with ops scheduling. Vendors will very often take 
> priority for performance over fixing a rare race condition and I will not 
> take nor recommend to take chances.
>
> I think this kind of advice should really not be given in a ceph context 
> without also referring to the pre-requisites: perfect firmware. Ceph is a 
> scale-out system and any large sized cluster will have enough drives to see 
> low-probability events on a regular basis. At least recommend to test that 
> thoroughly, that is, perform power-loss tests under load, and I mean many 
> power loss events per drive with randomised intervals under different load 
> patterns.
>
> Same applies to disk controllers with cache. Nobody recommends using the 
> controller cache because of firmware bugs that seem to be present in all 
> models. We have sufficient cases on this list for data loss after power loss 
> with controller cache being the issue. The recommendation is to enable HBA 
> mode and write-through. Do the same with your disk firmware, get better sleep 
> and better performance in one go.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <d...@vanderster.com>
> Sent: 29 November 2021 09:24:29
> To: Frank Schilder
> Cc: huxia...@horebdata.cn; YiteGu; ceph-users
> Subject: Re: [ceph-users] Re: Rocksdb: Corruption: missing start of 
> fragmented record(1)
>
> Hi Frank,
>
> That's true from the performance perspective, however it is not unsafe
> to leave the cache enabled -- ceph uses fsync appropriately to make
> the writes durable.
>
> This issue looks rather to be related to concurrent hardware failure.
>
> Cheers, Dan
>
> On Mon, Nov 29, 2021 at 9:21 AM Frank Schilder <fr...@dtu.dk> wrote:
> >
> > This may sound counter-intuitive, but you need to disable write cache to 
> > enable PLP cache only. SSDs with PLP have usually 2 types of cache, 
> > volatile and non-volatile. The volatile cache will experience data loss on 
> > power loss. It is the volatile cache that gets disabled when issuing the 
> > hd-/sdparm/smartctl command to switch it off. In many cases this can 
> > increase the non-volatile cache and also performance.
> >
> > It is the non-volatile cache you want your writes to go to directly.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: huxia...@horebdata.cn <huxia...@horebdata.cn>
> > Sent: 26 November 2021 22:41:10
> > To: YiteGu; ceph-users
> > Subject: [ceph-users] Re: Rocksdb: Corruption: missing start of fragmented 
> > record(1)
> >
> > wal/db are on Intel S4610 960GB SSDs, with PLP and write back on
> >
> >
> >
> > huxia...@horebdata.cn
> >
> > From: YiteGu
> > Date: 2021-11-26 11:32
> > To: huxia...@horebdata.cn; ceph-users
> > Subject: Re:[ceph-users] Rocksdb: Corruption: missing start of fragmented 
> > record(1)
> > It look like your wal/db device loss data.
> > please check your wal/db device whether have writeback cache, and power 
> > loss cause data loss. replay log failure when rocksdb restart.
> >
> >
> >
> > YiteGu
> > ess_...@qq.com
> >
> >
> >
> > ------------------ Original ------------------
> > From: "huxia...@horebdata.cn" <huxia...@horebdata.cn>;
> > Date: Fri, Nov 26, 2021 06:02 PM
> > To: "ceph-users"<ceph-users@ceph.io>;
> > Subject: [ceph-users] Rocksdb: Corruption: missing start of fragmented 
> > record(1)
> >
> > Dear Cephers,
> >
> > I just had one Ceph osd node (Luminous 12.2.13) power-loss unexpectedly, 
> > and after restarting that node, two OSDs out of 10 can not be started, 
> > issuing the following errors (see below image), in particular, i see
> >
> > Rocksdb: Corruption: missing start of fragmented record(1)
> > Bluestore(/var/lib/ceph/osd/osd-21) _open_db erroring opening db:
> > ...
> > **ERROR: OSD init failed: (5)  Input/output error
> >
> > I checked the db/val SSDs, and they are working fine. So I am wondering the 
> > following
> > 1) Is there a method to restore the OSDs?
> > 2) what could be the potential causes of the corrupted db/wal? The db/wal 
> > SSDs have PLP and not been damaged during the power loss
> >
> > Your help would be highly appreciated.
> >
> > best regards,
> >
> > samuel
> >
> >
> >
> >
> > huxia...@horebdata.cn
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Rocksdb: Corruption: missing start of fragmented record(1)

Reply via email to