Re: [ceph-users] CEPH over SW-RAID

Jose Tavares Mon, 23 Nov 2015 10:45:14 -0800

On Mon, Nov 23, 2015 at 4:07 PM, Jan Schermer <j...@schermer.cz> wrote:


> So I assume we _are_ talking about bit-rot?
>
> > On 23 Nov 2015, at 18:37, Jose Tavares <j...@terra.com.br> wrote:
> >
> > Yes, but with SW-RAID, when we have a block that was read and does not
> > match its checksum, the device falls out of the array, and the data is
> read
> > again from the other devices in the array.
>
> That's not true. SW-RAID reads data from one drive only. Comparison of the
> data on different drives only happens when a check is executed, and that's
> doesn't help with bit-rot one bit :-) (the same goes for various SANs and
> arrays, but those usually employ additional CRC for data so their BER is
> orders of magnitude higher.)



SW-RAID reads one data from one drive each time. But the drive itself has
its data checksum in hardware.

"In daily business, your hard disk does write a checksum and some ECC
information for every sector being written, and verifies this data during a
read operation."
http://serverfault.com/questions/645862/when-does-a-raid-restore-redundancy-after-a-broken-sector-is-flagged-as-defectiv

"If the disk is out of replacement sectors..." .. This is the most common
scenario we see these days .. so, the OS must deal with this bad blocks.




> > The problem is that in SW-RAID1
> > we don't have the badblocks isolated. The disks can be sincronized again
> as
> > the write operation is not tested. The problem (device falling out of the
> > array) will happen again if we try to read any other data written over
> the
> > bad block.
>
> Not true either. Bit-rot happens not (only) when the data gets written
> wrong, but when it is read. If you read one block long enough you will get
> wrong data once every $BER_bits. Rewriting the data doesn't help.
> (It's a bit different with some SSDs that don't refresh blocks so
> rewriting/refreshing them might help).
>
> >
> > My new question regarding Ceph is if it isolates this bad sectors where
> it
> > found bad data when scrubbing? or there will be always a replica of
> > something over a known bad block..?
> >
> > I also saw that Ceph use same metrics when capturing data from disks.
> When
> > the disk is resetting or have problems, its metrics are going to be bad
> and
> > the cluster will rank bad this osd. But I didn't saw any way of sending
> > alerts or anything like that. SW-RAID has its mdadm monitor that alerts
> > when things go bad. Should I have to be looking for ceph logs all the
> time
> > to see when things go bad?
>
> You should graph every drive and look for anomalies. Ceph only detects a
> problem when the drive is already very unusable (the ceph-osd process
> itself blocks for tens of seconds typically).
> CEPH is not really good when it comes to latency SLAs, no matter how much
> you try, but that's usually sufficient.
>
> >
> > Thanks.
> > Jose Tavares
> >
> > On Mon, Nov 23, 2015 at 3:19 PM, Robert LeBlanc <rob...@leblancnet.us>
> > wrote:
> >
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> Most people run their clusters with no RAID for the data disks (some
> >> will run RAID for the journals, but we don't). We use the scrub
> >> mechanism to find data inconsistency and we use three copies to do
> >> RAID over host/racks, etc. Unless you have a specific need, it is best
> >> to forgo the Linux SW RAID or even HW RAIDs too with Ceph.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Mon, Nov 23, 2015 at 10:09 AM, Jose Tavares  wrote:
> >>> Hi guys ...
> >>>
> >>> Is there any advantage in running CEPH over a Linux SW-RAID to avoid
> data
> >>> corruption due to disk bad blocks?
> >>>
> >>> Can we just rely on the scrubbing feature of CEPH? Can we live without
> an
> >>> underlying layer that avoids hardware problems to be passed to CEPH?
> >>>
> >>> I have a setup where I put one OSD per node and I have a 2 disk raid-1
> >>> setup. Is it a good option or it would be better if I had 2 OSDs, one
> in
> >>> each disk? If I had one OSD per disk, I would have to increase the
> >> number os
> >>> replicas to guarantee enough replicas if one node goes down.
> >>>
> >>> Thanks a lot.
> >>> Jose Tavares
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.2.3
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWU0qBCRDmVDuy+mK58QAAczAP/RducnXBNyeESCwUP/RC
> >> 3ELmoZxMO2ymrcQoutUVXfPTZk7f9pINUux4NRnglbVDxHasmNBHFKV3uWTS
> >> OBmaVuC99cwG/ekhmNaW9qmQIZiP8byijoDln26eqarhhuMECgbYxZhLtB9M
> >> A1W5gpKEvCBvYcjW9V/rwb0+V678Eo1IVlezwJ1TP3pxvRWpDsg1dIhOBit8
> >> PznnPTMS46RGFrFirTg1AfvmipSI3rhLFdR2g7xHrQs9UHdmC0OQ/Jcjnln+
> >> L0LNni7ht1lK80J9Mk4Q/nt7HfWCxJrg497Q+R0m+ab3qFJWBUGwofjbEnut
> >> JroMLph0sxAzmDSst8a15pzTYaIqMqKkGfGeHgiaNzePwELAY2AKwgx2AIlf
> >> iYJCtyiXRHnfQfQEi1TflWFuEaaAhKCPqRO7Duf6a+rEsSkvViaZ9Mtm1bSX
> >> KnLLSz8ZtXI4wTWbImXbpdhuGgHvKsEGWlU+YDuCil9i+PedM67us1Y6TAsT
> >> UWvCd8P385psITLI37Ly+YDHphjyeyYljCPGuom1e+/J3flElS/BgWUGUibB
> >> rA3QUNUIPWKO6F37JEDja13BShTE9I17Y3EpSgGGG3jnTt93/E4dEvR6mC/F
> >> qPPjs7EMvc99Xi7rTqtpm58JLGXWh3rMgjITJTwfLhGtCHgSvvrsRjmGB9Xa
> >> anPK
> >> =XQGP
> >> -----END PGP SIGNATURE-----
> >>
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

Reply via email to