Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

Muhammad Junaid Fri, 02 Aug 2019 03:08:50 -0700

Thanks Oliver and all others. This was really helpful. Regards.

Muhammad Junaid


On Thu, Aug 1, 2019 at 5:25 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Hi together,
>
> Am 01.08.19 um 08:45 schrieb Janne Johansson:
> > Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid <
> junaid.fsd...@gmail.com <mailto:junaid.fsd...@gmail.com>>:
> >
> >     Your email has cleared many things to me. Let me repeat my
> understanding. Every Critical data (Like Oracle/Any Other DB) writes will
> be done with sync, fsync flags, meaning they will be only confirmed to
> DB/APP after it is actually written to Hard drives/OSD's. Any other
> application can do it also.
> >     All other writes, like OS logs etc will be confirmed immediately to
> app/user but later on written  passing through kernel, RBD Cache, Physical
> drive Cache (If any)  and then to disks. These are susceptible to
> power-failure-loss but overall things are recoverable/non-critical.
> >
> >
> > That last part is probably simplified a bit, I suspect between a program
> in a guest sending its data to the virtualised device, running in a KVM on
> top of an OS that has remote storage over network, to a storage server with
> its own OS and drive controller chip and lastly physical drive(s) to store
> the write, there will be something like ~10 layers of write caching
> possible, out of which the RBD you were asking about, is just one.
> >
> > It is just located very conveniently before the I/O has to leave the KVM
> host and go back and forth over the network, so it is the last place where
> you can see huge gains in the guests I/O response time, but at the same
> time possible to share between lots of guests on the KVM host which should
> have tons of RAM available compared to any single guest so it is a nice way
> to get a large cache for outgoing writes.
> >
> > Also, to answer your first part, yes all critical software that depend
> heavily on write ordering and integrity is hopefully already doing write
> operations that way, asking for sync(), fsync() or fdatasync() and similar
> calls, but I can't produce a list of all programs that do. Since there
> already are many layers of delayed cached writes even without
> virtualisation and/or ceph, applications that are important have mostly
> learned their lessons by now, so chances are very high that all your
> important databases and similar program are doing the right thing.
>
> Just to add on this: One such software, for which people cared a lot, is
> of course a file system itself. BTRFS is notably a candidate very sensitive
> to broken flush / FUA (
> https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA) )
> implementations at any layer of the I/O path due to the rather complicated
> metadata structure.
> While for in-kernel and other open source software (such as librbd), there
> are usually a lot of people checking the code for a correct implementation
> and testing things, there is also broken hardware
> (or rather, firmware) in the wild.
>
> But there are even software issues around, if you think more general and
> strive for data correctness (since also corruption can happen at any layer):
> I was hit by an in-kernel issue in the past (network driver writing
> network statistics via DMA to the wrong memory location - "sometimes")
> corrupting two BTRFS partitions of mine, and causing random crashes in
> browsers and mail client apps. BTRFS has been hardened only in kernel 5.2
> to check the metadata tree before flushing it to disk.
>
> If you are curious about known hardware issues, check out this lengthy,
> but very insightful mail on the linux-btrfs list:
> https://lore.kernel.org/linux-btrfs/20190623204523.gc11...@hungrycats.org/
> As you can learn there, there are many drive and firmware combinations out
> there which do not implement flush / FUA correctly and your BTRFS may be
> corrupted after a power failure. The very same thing can happen to Ceph,
> but with replication across several OSDs and lower probability to have
> broken disks in all hosts makes this issue less likely.
>
> For what it is worth, we also use writeback caching for our virtualization
> cluster and are very happy with it - we also tried pulling power plugs on
> hypervisors, MONs and OSDs at random times during writes and ext4 could
> always recover easily with an fsck
> making use of the journal.
>
> Cheers and HTH,
>         Oliver
>
> >
> > But if the guest is instead running a mail filter that does antivirus
> checks, spam checks and so on, operating on files that live on the machine
> for something like one second, and then either get dropped or sent to the
> destination mailbox somewhere else, then having aggressive write caches
> would be very useful, since the effects of a crash would still mostly mean
> "the emails that were in the queue were lost, not acked by the final
> mailserver and will probably be resent by the previous smtp server". For
> such a guest VM, forcing sync writes would only be a net loss, it would
> gain much by having large ram write caches.
> >
> > --
> > May the most significant bit of your life be positive.
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

Reply via email to