Re: softraid(4) RAID1 tools or experimental patches for consistency checking

2020-01-13 Thread Karel Gardas



Few missing notes to this email:

- my performance testing, results and conclusion were done only on 
mechanical drives (hitachi 7k500 and wd re 500) and only with meta-data 
intensive workload. Basically tar -xf src.tar; unmount and rm -rf src; 
unmount where src.tar was src.tar of stable at that time.


- at the same time WAPBL was submitted to tech@ and IIRC it increased 
perf a lot since I've also been using limited checksum blocks caching 
(not in patch, not submitted yet) and since WAPBL log is on constant 
place RAID1c/s was more happy.


- at the time I've not had any ssd/nvme for testing. Situation may be 
different with this especially once someone think what's tolerable and 
what's not anymore w.r.t. speed.


On 1/12/20 9:58 PM, Karel Gardas wrote:


Tried something like that in the past: 
https://marc.info/?l=openbsd-tech=144217941801350=2


It worked kind of OK except the performance. The problem is that data 
layout makes read op. -> 2x read op. and write op. -> read op. + 2x 
write op. which is not the speed winner. Caching of checksuming blocks 
helped in some cases a lot, but was not submitted since you would also 
ideally need readahead and this was not done at all. The other perf 
issue is that putting this slow virtual drive impl. under already slow 
ffs is a receipt for disappointment from the perf. point of view. 
Certainly no speed daemon and certainly completely different league 
than checkumming able fss from open-source world (ZFS, btrfs, 
bcachefs. No HAMMER2 is not there since it checksum only meta-data and 
not user data and can't self-heal).


Yes, you are right that ideally drive would be fs aware to optimize 
rebuild, but this may be worked around by more clever layout marking 
also used blocks. Anyway, that's (and above) are IMHO reasons why 
development is done on checksumming fss instead of checksumming 
software raids. Read somewhere paper about linux's mdadm hacked to do 
checksums and the result was pretty much the same (IIRC!). E.g. perf. 
disappointment. If you are curious, google for it.


So, work on it if you can tolerate the speed...





Re: softraid(4) RAID1 tools or experimental patches for consistency checking

2020-01-12 Thread Karel Gardas



Tried something like that in the past: 
https://marc.info/?l=openbsd-tech=144217941801350=2


It worked kind of OK except the performance. The problem is that data 
layout makes read op. -> 2x read op. and write op. -> read op. + 2x 
write op. which is not the speed winner. Caching of checksuming blocks 
helped in some cases a lot, but was not submitted since you would also 
ideally need readahead and this was not done at all. The other perf 
issue is that putting this slow virtual drive impl. under already slow 
ffs is a receipt for disappointment from the perf. point of view. 
Certainly no speed daemon and certainly completely different league than 
checkumming able fss from open-source world (ZFS, btrfs, bcachefs. No 
HAMMER2 is not there since it checksum only meta-data and not user data 
and can't self-heal).


Yes, you are right that ideally drive would be fs aware to optimize 
rebuild, but this may be worked around by more clever layout marking 
also used blocks. Anyway, that's (and above) are IMHO reasons why 
development is done on checksumming fss instead of checksumming software 
raids. Read somewhere paper about linux's mdadm hacked to do checksums 
and the result was pretty much the same (IIRC!). E.g. perf. 
disappointment. If you are curious, google for it.


So, work on it if you can tolerate the speed...

On 1/12/20 6:46 AM, Constantine A. Murenin wrote:

Dear misc@,

I'm curious if anyone has any sort of tools / patches to verify the consistency 
of softraid(4) RAID1 volumes?


If one adds a new disc (i.e. chunk) to a volume with the RAID1 discipline, the 
resilvering process of softraid(4) will read data from one of the existing 
discs, and write it back to all the discs, ridding you of the artefacts that 
could potentially be used to reconstruct the flipped bits correctly.

Additionally, this resilvering process is also really slow.  Per my notes from 
a few years ago, softraid has a fixed block size of 64KB (MAXPHYS); if we're 
talking about spindle-based HDDs, they only support like 80 random IOPS at 7,2k 
RPM, half of which we gotta use for reads, half for writes; this means it'll 
take (1TB/64KB/(80/s/2)) = 4,5 days to resilver each 1TB of an average 7,2k RPM 
HDD; compare this with sequential resilvering, which will take (1TB/120MB/s) = 
2,3 hours; the reality may vary from these imprecise calculations, but these 
numbers do seem representative of the experience.

The above behaviour is defined here:

http://bxr.su/o/sys/dev/softraid_raid1.c#sr_raid1_rw

369} else {
370/* writes go on all working disks */
371chunk = i;
372scp = sd->sd_vol.sv_chunks[chunk];
373switch (scp->src_meta.scm_status) {
374case BIOC_SDONLINE:
375case BIOC_SDSCRUB:
376case BIOC_SDREBUILD:
377break;
378
379case BIOC_SDHOTSPARE: /* should never happen */
380case BIOC_SDOFFLINE:
381continue;
382
383default:
384goto bad;
385}
386}


What we could do is something like the following, to pretend that any online 
volume is not available for writes when the wu (Work Unit) we're handling is 
part of the rebuild process from http://bxr.su/o/sys/dev/softraid.c#sr_rebuild, 
mimicking the BIOC_SDOFFLINE behaviour for BIOC_SDONLINE chunks (discs) when 
the SR_WUF_REBUILD flag is set for the workunit:

switch (scp->src_meta.scm_status) {
case BIOC_SDONLINE:
+   if (wu->swu_flags & SR_WUF_REBUILD)
+   continue;   /* must be same as 
BIOC_SDOFFLINE case */
+   /* FALLTHROUGH */
case BIOC_SDSCRUB:
case BIOC_SDREBUILD:


Obviously, there's both pros and cons to such an approach; I've tested a 
variation of the above in production (not a fan weeks-long random-read/write 
rebuilds); but use this at your own risk, obviously.

...

But back to the original problem, this consistency check would have to be 
file-system-specific, because we gotta know which blocks of softraid have and 
have not been used by the filesystem, as softraid itself is 
filesystem-agnostic.  I'd imagine it'll be somewhat similar in concept to the 
fstrim(8) utility on GNU/Linux -- 
http://man7.org/linux/man-pages/man8/fstrim.8.html -- and would also open the 
door for the cron-based TRIM support as well (it would also have to know the 
softraid format itself, too).  Any pointers or hints where to get started, or 
whether anyone has worked on this in the past?


Cheers,
Constantine.http://cm.su/





softraid(4) RAID1 tools or experimental patches for consistency checking

2020-01-12 Thread Constantine A. Murenin
Dear misc@,

I'm curious if anyone has any sort of tools / patches to verify the consistency 
of softraid(4) RAID1 volumes?


If one adds a new disc (i.e. chunk) to a volume with the RAID1 discipline, the 
resilvering process of softraid(4) will read data from one of the existing 
discs, and write it back to all the discs, ridding you of the artefacts that 
could potentially be used to reconstruct the flipped bits correctly.

Additionally, this resilvering process is also really slow.  Per my notes from 
a few years ago, softraid has a fixed block size of 64KB (MAXPHYS); if we're 
talking about spindle-based HDDs, they only support like 80 random IOPS at 7,2k 
RPM, half of which we gotta use for reads, half for writes; this means it'll 
take (1TB/64KB/(80/s/2)) = 4,5 days to resilver each 1TB of an average 7,2k RPM 
HDD; compare this with sequential resilvering, which will take (1TB/120MB/s) = 
2,3 hours; the reality may vary from these imprecise calculations, but these 
numbers do seem representative of the experience.

The above behaviour is defined here:

http://bxr.su/o/sys/dev/softraid_raid1.c#sr_raid1_rw

369} else {
370/* writes go on all working disks */
371chunk = i;
372scp = sd->sd_vol.sv_chunks[chunk];
373switch (scp->src_meta.scm_status) {
374case BIOC_SDONLINE:
375case BIOC_SDSCRUB:
376case BIOC_SDREBUILD:
377break;
378
379case BIOC_SDHOTSPARE: /* should never happen */
380case BIOC_SDOFFLINE:
381continue;
382
383default:
384goto bad;
385}
386}


What we could do is something like the following, to pretend that any online 
volume is not available for writes when the wu (Work Unit) we're handling is 
part of the rebuild process from http://bxr.su/o/sys/dev/softraid.c#sr_rebuild, 
mimicking the BIOC_SDOFFLINE behaviour for BIOC_SDONLINE chunks (discs) when 
the SR_WUF_REBUILD flag is set for the workunit:

switch (scp->src_meta.scm_status) {
case BIOC_SDONLINE:
+   if (wu->swu_flags & SR_WUF_REBUILD)
+   continue;   /* must be same as 
BIOC_SDOFFLINE case */
+   /* FALLTHROUGH */
case BIOC_SDSCRUB:
case BIOC_SDREBUILD:


Obviously, there's both pros and cons to such an approach; I've tested a 
variation of the above in production (not a fan weeks-long random-read/write 
rebuilds); but use this at your own risk, obviously.

...

But back to the original problem, this consistency check would have to be 
file-system-specific, because we gotta know which blocks of softraid have and 
have not been used by the filesystem, as softraid itself is 
filesystem-agnostic.  I'd imagine it'll be somewhat similar in concept to the 
fstrim(8) utility on GNU/Linux -- 
http://man7.org/linux/man-pages/man8/fstrim.8.html -- and would also open the 
door for the cron-based TRIM support as well (it would also have to know the 
softraid format itself, too).  Any pointers or hints where to get started, or 
whether anyone has worked on this in the past?


Cheers,
Constantine.http://cm.su/