On Mon, Nov 23, 2015 at 4:07 PM, Jan Schermer <j...@schermer.cz> wrote:
> So I assume we _are_ talking about bit-rot? > > > On 23 Nov 2015, at 18:37, Jose Tavares <j...@terra.com.br> wrote: > > > > Yes, but with SW-RAID, when we have a block that was read and does not > > match its checksum, the device falls out of the array, and the data is > read > > again from the other devices in the array. > > That's not true. SW-RAID reads data from one drive only. Comparison of the > data on different drives only happens when a check is executed, and that's > doesn't help with bit-rot one bit :-) (the same goes for various SANs and > arrays, but those usually employ additional CRC for data so their BER is > orders of magnitude higher.) SW-RAID reads one data from one drive each time. But the drive itself has its data checksum in hardware. "In daily business, your hard disk does write a checksum and some ECC information for every sector being written, and verifies this data during a read operation." http://serverfault.com/questions/645862/when-does-a-raid-restore-redundancy-after-a-broken-sector-is-flagged-as-defectiv "If the disk is out of replacement sectors..." .. This is the most common scenario we see these days .. so, the OS must deal with this bad blocks. > > The problem is that in SW-RAID1 > > we don't have the badblocks isolated. The disks can be sincronized again > as > > the write operation is not tested. The problem (device falling out of the > > array) will happen again if we try to read any other data written over > the > > bad block. > > Not true either. Bit-rot happens not (only) when the data gets written > wrong, but when it is read. If you read one block long enough you will get > wrong data once every $BER_bits. Rewriting the data doesn't help. > (It's a bit different with some SSDs that don't refresh blocks so > rewriting/refreshing them might help). > > > > > My new question regarding Ceph is if it isolates this bad sectors where > it > > found bad data when scrubbing? or there will be always a replica of > > something over a known bad block..? > > > > I also saw that Ceph use same metrics when capturing data from disks. > When > > the disk is resetting or have problems, its metrics are going to be bad > and > > the cluster will rank bad this osd. But I didn't saw any way of sending > > alerts or anything like that. SW-RAID has its mdadm monitor that alerts > > when things go bad. Should I have to be looking for ceph logs all the > time > > to see when things go bad? > > You should graph every drive and look for anomalies. Ceph only detects a > problem when the drive is already very unusable (the ceph-osd process > itself blocks for tens of seconds typically). > CEPH is not really good when it comes to latency SLAs, no matter how much > you try, but that's usually sufficient. > > > > > Thanks. > > Jose Tavares > > > > On Mon, Nov 23, 2015 at 3:19 PM, Robert LeBlanc <rob...@leblancnet.us> > > wrote: > > > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA256 > >> > >> Most people run their clusters with no RAID for the data disks (some > >> will run RAID for the journals, but we don't). We use the scrub > >> mechanism to find data inconsistency and we use three copies to do > >> RAID over host/racks, etc. Unless you have a specific need, it is best > >> to forgo the Linux SW RAID or even HW RAIDs too with Ceph. > >> - ---------------- > >> Robert LeBlanc > >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > >> > >> > >> On Mon, Nov 23, 2015 at 10:09 AM, Jose Tavares wrote: > >>> Hi guys ... > >>> > >>> Is there any advantage in running CEPH over a Linux SW-RAID to avoid > data > >>> corruption due to disk bad blocks? > >>> > >>> Can we just rely on the scrubbing feature of CEPH? Can we live without > an > >>> underlying layer that avoids hardware problems to be passed to CEPH? > >>> > >>> I have a setup where I put one OSD per node and I have a 2 disk raid-1 > >>> setup. Is it a good option or it would be better if I had 2 OSDs, one > in > >>> each disk? If I had one OSD per disk, I would have to increase the > >> number os > >>> replicas to guarantee enough replicas if one node goes down. > >>> > >>> Thanks a lot. > >>> Jose Tavares > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >> > >> -----BEGIN PGP SIGNATURE----- > >> Version: Mailvelope v1.2.3 > >> Comment: https://www.mailvelope.com > >> > >> wsFcBAEBCAAQBQJWU0qBCRDmVDuy+mK58QAAczAP/RducnXBNyeESCwUP/RC > >> 3ELmoZxMO2ymrcQoutUVXfPTZk7f9pINUux4NRnglbVDxHasmNBHFKV3uWTS > >> OBmaVuC99cwG/ekhmNaW9qmQIZiP8byijoDln26eqarhhuMECgbYxZhLtB9M > >> A1W5gpKEvCBvYcjW9V/rwb0+V678Eo1IVlezwJ1TP3pxvRWpDsg1dIhOBit8 > >> PznnPTMS46RGFrFirTg1AfvmipSI3rhLFdR2g7xHrQs9UHdmC0OQ/Jcjnln+ > >> L0LNni7ht1lK80J9Mk4Q/nt7HfWCxJrg497Q+R0m+ab3qFJWBUGwofjbEnut > >> JroMLph0sxAzmDSst8a15pzTYaIqMqKkGfGeHgiaNzePwELAY2AKwgx2AIlf > >> iYJCtyiXRHnfQfQEi1TflWFuEaaAhKCPqRO7Duf6a+rEsSkvViaZ9Mtm1bSX > >> KnLLSz8ZtXI4wTWbImXbpdhuGgHvKsEGWlU+YDuCil9i+PedM67us1Y6TAsT > >> UWvCd8P385psITLI37Ly+YDHphjyeyYljCPGuom1e+/J3flElS/BgWUGUibB > >> rA3QUNUIPWKO6F37JEDja13BShTE9I17Y3EpSgGGG3jnTt93/E4dEvR6mC/F > >> qPPjs7EMvc99Xi7rTqtpm58JLGXWh3rMgjITJTwfLhGtCHgSvvrsRjmGB9Xa > >> anPK > >> =XQGP > >> -----END PGP SIGNATURE----- > >> > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com