Re: SR RAID5 rebuild/stability issue.

2015-09-23 Thread Joel Sing
On Tuesday 22 September 2015 09:58:57 Karel Gardas wrote:
> On Tue, Sep 22, 2015 at 3:20 AM, Chris Cappuccio  wrote:
> > Karel Gardas [gard...@gmail.com] wrote:
> >> Let me ask, should SR RAID5 survive such testing or is for example
> >> rebuilding with off-lined drive considered unsupported feature?
> > 
> > It's new, considered experimental and not well tested.
> 
> OK so I'll omit this from my testing.
> 
> > Are you working with someone to bring your RAID1 changes in tree? The
> > complete, understood improvements should be individually labeled
> > and committed, one by one.
> 
> So far on tech@ I was merely ignored, but this is probably due to the
> fact that I posted patches[1][2][3] clearly marked as a
> work-in-progress. Once the patch is complete I will offer my view how
> it may be divided and perhaps discussion will start...

It has not been ignored; but you've not yet received a reply :)

> [1] https://www.mail-archive.com/tech@openbsd.org/msg25388.html
> [2] https://www.mail-archive.com/tech@openbsd.org/msg25419.html
> [3] https://www.mail-archive.com/tech@openbsd.org/msg25716.html



Re: SR RAID5 rebuild/stability issue.

2015-09-23 Thread Joel Sing
On Monday 21 September 2015 23:02:39 Karel Gardas wrote:
> Hello,
> 
> due to work on SR RAID1 check summing support where I've touched SR
> RAID internals (workunit scheduling) I'd like to test SR RAID5/6
> functionality on snapshot and on my tree to see that I've not broken
> the stuff while hacking it. My current problem is that I'm not able to
> come with some testing which would not break RAID5 (I'm starting with
> it) after several hours of execution while using snapshot. My test is
> basically:
> - on one console in loop
>   mount raid to /raid
>   rsync /usr/src/ to /raid
>   compute sha1 sums of all files in /raid
>   umount /raid
>   mount /raid
>   check sha1 -- if failure, fail the test, if not, just repeat
> - on another console in loop
>   - off line random drive
>   - wait random time (up to minute)
>   - rebuild raid with the offlined drive
>   - wait random time (up to 2 minutes)
>   - repeat
> 
> Now, the issue with this is that I get sha1 errors from time to time.
> Usually in such case the problematic source file contain some garbage.
> Since I do not yet have a machine dedicated to this testing, I'm using
> for this thinkpad T500 with one drive. I just created 4 RAID slices in
> OpenBSD partition. Last week I've been using vndX devices (and files),
> but this way I even got to kernel panic (on snapshot) like this one:
> http://openbsd-archive.7691.n7.nabble.com/panic-ffs-valloc-dup-alloc-td25473
> 8.html -- so this weekend I've started testing with slices and so far not
> panic, but still data corruption issue. Last snapshot I'm using for testing
> is from last Sunday.
> 
> Let me ask, should SR RAID5 survive such testing or is for example
> rebuilding with off-lined drive considered unsupported feature?

RAID5 should work (ignore RAID6 - it is still incomplete) and rebuilding 
should be functional:

 http://undeadly.org/cgi?action=article=20150413071009

When I reenabled RAID5, I had tested it reasonably as I could, but it still 
needs to be put through its paces. How are you offlining the drive? If you're 
doing it via bioctl then it will potentially behave differently to a hardware 
failure (top down through the bio(4)/softraid(4) driver, instead of bottom up 
via the I/O path). If you can dependably reproduce the issue then I would 
certainly be interested in tracking down the cause.



Re: SR RAID5 rebuild/stability issue.

2015-09-23 Thread Karel Gardas
On Wed, Sep 23, 2015 at 5:27 PM, Joel Sing  wrote:
> RAID5 should work (ignore RAID6 - it is still incomplete) and rebuilding
> should be functional:
>
>  http://undeadly.org/cgi?action=article=20150413071009
>
> When I reenabled RAID5, I had tested it reasonably as I could, but it still
> needs to be put through its paces. How are you offlining the drive? If you're
> doing it via bioctl then it will potentially behave differently to a hardware
> failure (top down through the bio(4)/softraid(4) driver, instead of bottom up
> via the I/O path). If you can dependably reproduce the issue then I would
> certainly be interested in tracking down the cause.

I'm using bioctl offlining indeed. And yes, the issue is easily
duplicable, but it takes time. I'll send you my scripts off-list.
Thanks for the note about RAID6.



Re: SR RAID5 rebuild/stability issue.

2015-09-22 Thread Karel Gardas
On Tue, Sep 22, 2015 at 3:20 AM, Chris Cappuccio  wrote:
> Karel Gardas [gard...@gmail.com] wrote:
>>
>> Let me ask, should SR RAID5 survive such testing or is for example
>> rebuilding with off-lined drive considered unsupported feature?
>>
>
> It's new, considered experimental and not well tested.

OK so I'll omit this from my testing.

> Are you working with someone to bring your RAID1 changes in tree? The
> complete, understood improvements should be individually labeled
> and committed, one by one.

So far on tech@ I was merely ignored, but this is probably due to the
fact that I posted patches[1][2][3] clearly marked as a
work-in-progress. Once the patch is complete I will offer my view how
it may be divided and perhaps discussion will start...

[1] https://www.mail-archive.com/tech@openbsd.org/msg25388.html
[2] https://www.mail-archive.com/tech@openbsd.org/msg25419.html
[3] https://www.mail-archive.com/tech@openbsd.org/msg25716.html



Re: SR RAID5 rebuild/stability issue.

2015-09-21 Thread Chris Cappuccio
Karel Gardas [gard...@gmail.com] wrote:
> 
> Let me ask, should SR RAID5 survive such testing or is for example
> rebuilding with off-lined drive considered unsupported feature?
> 

It's new, considered experimental and not well tested.

In my initial testing with RAID5, it was so slow as to be unusable. The IOPS
too low and latency too high compared to soft RAID1, single drive, or hw
RAID 5. I didn't consider using it seriously. Now your testing shows a
more significant problem.

Are you working with someone to bring your RAID1 changes in tree? The
complete, understood improvements should be individually labeled
and committed, one by one.

Chris



SR RAID5 rebuild/stability issue.

2015-09-21 Thread Karel Gardas
Hello,

due to work on SR RAID1 check summing support where I've touched SR
RAID internals (workunit scheduling) I'd like to test SR RAID5/6
functionality on snapshot and on my tree to see that I've not broken
the stuff while hacking it. My current problem is that I'm not able to
come with some testing which would not break RAID5 (I'm starting with
it) after several hours of execution while using snapshot. My test is
basically:
- on one console in loop
  mount raid to /raid
  rsync /usr/src/ to /raid
  compute sha1 sums of all files in /raid
  umount /raid
  mount /raid
  check sha1 -- if failure, fail the test, if not, just repeat
- on another console in loop
  - off line random drive
  - wait random time (up to minute)
  - rebuild raid with the offlined drive
  - wait random time (up to 2 minutes)
  - repeat

Now, the issue with this is that I get sha1 errors from time to time.
Usually in such case the problematic source file contain some garbage.
Since I do not yet have a machine dedicated to this testing, I'm using
for this thinkpad T500 with one drive. I just created 4 RAID slices in
OpenBSD partition. Last week I've been using vndX devices (and files),
but this way I even got to kernel panic (on snapshot) like this one:
http://openbsd-archive.7691.n7.nabble.com/panic-ffs-valloc-dup-alloc-td254738.html
-- so this weekend I've started testing with slices and so far not
panic, but still data corruption issue. Last snapshot I'm using for
testing is from last Sunday.

Let me ask, should SR RAID5 survive such testing or is for example
rebuilding with off-lined drive considered unsupported feature?

Thanks!
Karel