Re: SR RAID5 rebuild/stability issue.
On Wed, Sep 23, 2015 at 5:27 PM, Joel Sing wrote: > RAID5 should work (ignore RAID6 - it is still incomplete) and rebuilding > should be functional: > > http://undeadly.org/cgi?action=article&sid=20150413071009 > > When I reenabled RAID5, I had tested it reasonably as I could, but it still > needs to be put through its paces. How are you offlining the drive? If you're > doing it via bioctl then it will potentially behave differently to a hardware > failure (top down through the bio(4)/softraid(4) driver, instead of bottom up > via the I/O path). If you can dependably reproduce the issue then I would > certainly be interested in tracking down the cause. I'm using bioctl offlining indeed. And yes, the issue is easily duplicable, but it takes time. I'll send you my scripts off-list. Thanks for the note about RAID6.
Re: SR RAID5 rebuild/stability issue.
On Tuesday 22 September 2015 09:58:57 Karel Gardas wrote: > On Tue, Sep 22, 2015 at 3:20 AM, Chris Cappuccio wrote: > > Karel Gardas [gard...@gmail.com] wrote: > >> Let me ask, should SR RAID5 survive such testing or is for example > >> rebuilding with off-lined drive considered unsupported feature? > > > > It's new, considered experimental and not well tested. > > OK so I'll omit this from my testing. > > > Are you working with someone to bring your RAID1 changes in tree? The > > complete, understood improvements should be individually labeled > > and committed, one by one. > > So far on tech@ I was merely ignored, but this is probably due to the > fact that I posted patches[1][2][3] clearly marked as a > work-in-progress. Once the patch is complete I will offer my view how > it may be divided and perhaps discussion will start... It has not been ignored; but you've not yet received a reply :) > [1] https://www.mail-archive.com/tech@openbsd.org/msg25388.html > [2] https://www.mail-archive.com/tech@openbsd.org/msg25419.html > [3] https://www.mail-archive.com/tech@openbsd.org/msg25716.html
Re: SR RAID5 rebuild/stability issue.
On Monday 21 September 2015 23:02:39 Karel Gardas wrote: > Hello, > > due to work on SR RAID1 check summing support where I've touched SR > RAID internals (workunit scheduling) I'd like to test SR RAID5/6 > functionality on snapshot and on my tree to see that I've not broken > the stuff while hacking it. My current problem is that I'm not able to > come with some testing which would not break RAID5 (I'm starting with > it) after several hours of execution while using snapshot. My test is > basically: > - on one console in loop > mount raid to /raid > rsync /usr/src/ to /raid > compute sha1 sums of all files in /raid > umount /raid > mount /raid > check sha1 -- if failure, fail the test, if not, just repeat > - on another console in loop > - off line random drive > - wait random time (up to minute) > - rebuild raid with the offlined drive > - wait random time (up to 2 minutes) > - repeat > > Now, the issue with this is that I get sha1 errors from time to time. > Usually in such case the problematic source file contain some garbage. > Since I do not yet have a machine dedicated to this testing, I'm using > for this thinkpad T500 with one drive. I just created 4 RAID slices in > OpenBSD partition. Last week I've been using vndX devices (and files), > but this way I even got to kernel panic (on snapshot) like this one: > http://openbsd-archive.7691.n7.nabble.com/panic-ffs-valloc-dup-alloc-td25473 > 8.html -- so this weekend I've started testing with slices and so far not > panic, but still data corruption issue. Last snapshot I'm using for testing > is from last Sunday. > > Let me ask, should SR RAID5 survive such testing or is for example > rebuilding with off-lined drive considered unsupported feature? RAID5 should work (ignore RAID6 - it is still incomplete) and rebuilding should be functional: http://undeadly.org/cgi?action=article&sid=20150413071009 When I reenabled RAID5, I had tested it reasonably as I could, but it still needs to be put through its paces. How are you offlining the drive? If you're doing it via bioctl then it will potentially behave differently to a hardware failure (top down through the bio(4)/softraid(4) driver, instead of bottom up via the I/O path). If you can dependably reproduce the issue then I would certainly be interested in tracking down the cause.
Re: SR RAID5 rebuild/stability issue.
On Tue, Sep 22, 2015 at 3:20 AM, Chris Cappuccio wrote: > Karel Gardas [gard...@gmail.com] wrote: >> >> Let me ask, should SR RAID5 survive such testing or is for example >> rebuilding with off-lined drive considered unsupported feature? >> > > It's new, considered experimental and not well tested. OK so I'll omit this from my testing. > Are you working with someone to bring your RAID1 changes in tree? The > complete, understood improvements should be individually labeled > and committed, one by one. So far on tech@ I was merely ignored, but this is probably due to the fact that I posted patches[1][2][3] clearly marked as a work-in-progress. Once the patch is complete I will offer my view how it may be divided and perhaps discussion will start... [1] https://www.mail-archive.com/tech@openbsd.org/msg25388.html [2] https://www.mail-archive.com/tech@openbsd.org/msg25419.html [3] https://www.mail-archive.com/tech@openbsd.org/msg25716.html
Re: SR RAID5 rebuild/stability issue.
Karel Gardas [gard...@gmail.com] wrote: > > Let me ask, should SR RAID5 survive such testing or is for example > rebuilding with off-lined drive considered unsupported feature? > It's new, considered experimental and not well tested. In my initial testing with RAID5, it was so slow as to be unusable. The IOPS too low and latency too high compared to soft RAID1, single drive, or hw RAID 5. I didn't consider using it seriously. Now your testing shows a more significant problem. Are you working with someone to bring your RAID1 changes in tree? The complete, understood improvements should be individually labeled and committed, one by one. Chris
SR RAID5 rebuild/stability issue.
Hello, due to work on SR RAID1 check summing support where I've touched SR RAID internals (workunit scheduling) I'd like to test SR RAID5/6 functionality on snapshot and on my tree to see that I've not broken the stuff while hacking it. My current problem is that I'm not able to come with some testing which would not break RAID5 (I'm starting with it) after several hours of execution while using snapshot. My test is basically: - on one console in loop mount raid to /raid rsync /usr/src/ to /raid compute sha1 sums of all files in /raid umount /raid mount /raid check sha1 -- if failure, fail the test, if not, just repeat - on another console in loop - off line random drive - wait random time (up to minute) - rebuild raid with the offlined drive - wait random time (up to 2 minutes) - repeat Now, the issue with this is that I get sha1 errors from time to time. Usually in such case the problematic source file contain some garbage. Since I do not yet have a machine dedicated to this testing, I'm using for this thinkpad T500 with one drive. I just created 4 RAID slices in OpenBSD partition. Last week I've been using vndX devices (and files), but this way I even got to kernel panic (on snapshot) like this one: http://openbsd-archive.7691.n7.nabble.com/panic-ffs-valloc-dup-alloc-td254738.html -- so this weekend I've started testing with slices and so far not panic, but still data corruption issue. Last snapshot I'm using for testing is from last Sunday. Let me ask, should SR RAID5 survive such testing or is for example rebuilding with off-lined drive considered unsupported feature? Thanks! Karel