Thanks Oliver and all others. This was really helpful. Regards. Muhammad Junaid
On Thu, Aug 1, 2019 at 5:25 PM Oliver Freyermuth < freyerm...@physik.uni-bonn.de> wrote: > Hi together, > > Am 01.08.19 um 08:45 schrieb Janne Johansson: > > Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid < > junaid.fsd...@gmail.com <mailto:junaid.fsd...@gmail.com>>: > > > > Your email has cleared many things to me. Let me repeat my > understanding. Every Critical data (Like Oracle/Any Other DB) writes will > be done with sync, fsync flags, meaning they will be only confirmed to > DB/APP after it is actually written to Hard drives/OSD's. Any other > application can do it also. > > All other writes, like OS logs etc will be confirmed immediately to > app/user but later on written passing through kernel, RBD Cache, Physical > drive Cache (If any) and then to disks. These are susceptible to > power-failure-loss but overall things are recoverable/non-critical. > > > > > > That last part is probably simplified a bit, I suspect between a program > in a guest sending its data to the virtualised device, running in a KVM on > top of an OS that has remote storage over network, to a storage server with > its own OS and drive controller chip and lastly physical drive(s) to store > the write, there will be something like ~10 layers of write caching > possible, out of which the RBD you were asking about, is just one. > > > > It is just located very conveniently before the I/O has to leave the KVM > host and go back and forth over the network, so it is the last place where > you can see huge gains in the guests I/O response time, but at the same > time possible to share between lots of guests on the KVM host which should > have tons of RAM available compared to any single guest so it is a nice way > to get a large cache for outgoing writes. > > > > Also, to answer your first part, yes all critical software that depend > heavily on write ordering and integrity is hopefully already doing write > operations that way, asking for sync(), fsync() or fdatasync() and similar > calls, but I can't produce a list of all programs that do. Since there > already are many layers of delayed cached writes even without > virtualisation and/or ceph, applications that are important have mostly > learned their lessons by now, so chances are very high that all your > important databases and similar program are doing the right thing. > > Just to add on this: One such software, for which people cared a lot, is > of course a file system itself. BTRFS is notably a candidate very sensitive > to broken flush / FUA ( > https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA) ) > implementations at any layer of the I/O path due to the rather complicated > metadata structure. > While for in-kernel and other open source software (such as librbd), there > are usually a lot of people checking the code for a correct implementation > and testing things, there is also broken hardware > (or rather, firmware) in the wild. > > But there are even software issues around, if you think more general and > strive for data correctness (since also corruption can happen at any layer): > I was hit by an in-kernel issue in the past (network driver writing > network statistics via DMA to the wrong memory location - "sometimes") > corrupting two BTRFS partitions of mine, and causing random crashes in > browsers and mail client apps. BTRFS has been hardened only in kernel 5.2 > to check the metadata tree before flushing it to disk. > > If you are curious about known hardware issues, check out this lengthy, > but very insightful mail on the linux-btrfs list: > https://lore.kernel.org/linux-btrfs/20190623204523.gc11...@hungrycats.org/ > As you can learn there, there are many drive and firmware combinations out > there which do not implement flush / FUA correctly and your BTRFS may be > corrupted after a power failure. The very same thing can happen to Ceph, > but with replication across several OSDs and lower probability to have > broken disks in all hosts makes this issue less likely. > > For what it is worth, we also use writeback caching for our virtualization > cluster and are very happy with it - we also tried pulling power plugs on > hypervisors, MONs and OSDs at random times during writes and ext4 could > always recover easily with an fsck > making use of the journal. > > Cheers and HTH, > Oliver > > > > > But if the guest is instead running a mail filter that does antivirus > checks, spam checks and so on, operating on files that live on the machine > for something like one second, and then either get dropped or sent to the > destination mailbox somewhere else, then having aggressive write caches > would be very useful, since the effects of a crash would still mostly mean > "the emails that were in the queue were lost, not acked by the final > mailserver and will probably be resent by the previous smtp server". For > such a guest VM, forcing sync writes would only be a net loss, it would > gain much by having large ram write caches. > > > > -- > > May the most significant bit of your life be positive. > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com