Re: RAID5/ext3 trouble...

2007-03-18 Thread Dan Bar Dov

That's a good question to which I have no answer.
I don't know how google do it. I can think of
1. special file system
2. some kind of scrubber - a daemon scanning for FS changes and copying
whatever changed
3. use a sync tool (rsync?) on adaily (hourly?) basis

I doubt google has 1. This is some good startup material.

Dan


On 3/18/07, Amos Shapira [EMAIL PROTECTED] wrote:

On 07/03/07, Dan Bar Dov [EMAIL PROTECTED] wrote:
 I suggest you read the latest summary on the reports on storage at
 http://storagemojo.com/?p=383
 http://storagemojo.com/?p=378

 The bottom line conclusions I made out of those are
 1. Do not use RAID5. Stick to RAID 1 (0+1).
 2. Use cheap SATA storage - even build-yourself. The big bucks don't
 buy you real reliability.
 3. If at all possible - implement a 3 copies/file scheme rather than
 RAID1/5/6. In the long run is is much simpler and more reliable.

 I'd be glad to discuss this more, but man, keep your posts shorter :-)

Thanks for the pointers.

Now - how would you suggest implementing your point 3, short of having
access to an implementation of Google FS?

I'm in a position to re-design a system which requires reliable,
fault-tolerant access to files over LAN for processing (i.e. read files,
calculate stuff on their input, write other files and to a database) and
archiving.

I'd rather build the solution on proven tools instead of re-inventing a
home-grown solution from scratch, but so far I haven't found something I can
bet the company's fate on.

The current system is built on Windows and though the business owner is open
to think about a move to Linux, it will have to be a gradual.

Thanks,

--Amos



=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Re: RAID5/ext3 trouble...

2007-03-18 Thread Amos Shapira

On 18/03/07, Dan Bar Dov [EMAIL PROTECTED] wrote:


That's a good question to which I have no answer.
I don't know how google do it. I can think of
1. special file system
2. some kind of scrubber - a daemon scanning for FS changes and copying
whatever changed
3. use a sync tool (rsync?) on adaily (hourly?) basis

I doubt google has 1. This is some good startup material.



Actually they have (1): [1] and they relay a lot on it. But apart from
describing it in this paper (and mentioning it in other publications, e.g.
[2] is the latest I read, among tons of others) they don't provide the
actual code, which I suppose I understand since it could be viewed as part
of their core business (organizing the world's information).

Another simple idea I just had overnight - cross-mirror disks among a couple
of nodes (preferably on different racks, when we reach such a stage) either
as database replication or through DRBD-style disk replication (sort of
RAID 0 over net), then if one of the nodes goes down its partner can take
over handling of its queue of jobs.

If the rest of the system is architected to use local data as much as
possible (i.e. all the stages which process the same piece of info run on
the same node) this might be enough to achieve both redundancy and
reliability, while still minimizing network traffic (second conclusion in
[2])) since the mirroring over the net can be done asynchronously - losing a
transaction during a node's crash should have negligible effect.

[1] http://labs.google.com/papers/gfs.html
[2] http://labs.google.com/papers/mapreduce.html

--Amos


Re: RAID5/ext3 trouble...

2007-03-17 Thread Amos Shapira

On 07/03/07, Dan Bar Dov [EMAIL PROTECTED] wrote:


I suggest you read the latest summary on the reports on storage at
http://storagemojo.com/?p=383
http://storagemojo.com/?p=378

The bottom line conclusions I made out of those are
1. Do not use RAID5. Stick to RAID 1 (0+1).
2. Use cheap SATA storage - even build-yourself. The big bucks don't
buy you real reliability.
3. If at all possible - implement a 3 copies/file scheme rather than
RAID1/5/6. In the long run is is much simpler and more reliable.

I'd be glad to discuss this more, but man, keep your posts shorter :-)



Thanks for the pointers.

Now - how would you suggest implementing your point 3, short of having
access to an implementation of Google FS?

I'm in a position to re-design a system which requires reliable,
fault-tolerant access to files over LAN for processing (i.e. read files,
calculate stuff on their input, write other files and to a database) and
archiving.

I'd rather build the solution on proven tools instead of re-inventing a
home-grown solution from scratch, but so far I haven't found something I can
bet the company's fate on.

The current system is built on Windows and though the business owner is open
to think about a move to Linux, it will have to be a gradual.

Thanks,

--Amos


Re: RAID5/ext3 trouble...

2007-03-06 Thread Dan Bar Dov

I suggest you read the latest summary on the reports on storage at
http://storagemojo.com/?p=383
http://storagemojo.com/?p=378

The bottom line conclusions I made out of those are
1. Do not use RAID5. Stick to RAID 1 (0+1).
2. Use cheap SATA storage - even build-yourself. The big bucks don't
buy you real reliability.
3. If at all possible - implement a 3 copies/file scheme rather than
RAID1/5/6. In the long run is is much simpler and more reliable.

I'd be glad to discuss this more, but man, keep your posts shorter :-)

Dan

On 3/5/07, Gunny Smith [EMAIL PROTECTED] wrote:

Hey all

I formerly had a gentoo box with a 4x250GB software RAID5, no LVM,
ext3 filesystem, using disks sdc,sdd,sde and sdf. sda and sdb were the
OS disks.
I was to move the RAID over to a new machine running debian. Further
to this, my backup of the data turned out to be only partial.

The new disks in the system appeared as sda, sdb, sdc and sdd.

I've then done a series of damaging things with them:

1. I've added 3 of them as spares rather than as drives with existing data.
2. When trying to assemble the array, this started rebuilding
(effectively, wiping) the fourth disk. I did this several times (and I
suspect I've clobbered data on more than one drive) before realizing I
can sidestep this by assembling three drives at a time - which raises
the array as degraded and doesn't kick off anything that can be
destructive.
3. I've run --create
4. I've cleared the persistent superblocks off the drives and
attempted to recreate them.

Roughly at this point I decided this was unsalvageable and went to the
backup, only to find out it failed to back up parts of the data. A
memorable moment in my life.

I've then blocked together a bunch of other harddrives to assemble a
~800GB volume, plonked an LVM on it, shared it as nfs, and made dd
if=/dev/sdX ... copies in files of all raw RAID drives over LAN. At
least I have a snapshot of things as they are right now before I start
fixing things again.

I attempted assembling a, b and c (d is the one I suspect is most
damaged) - Raid5 degraded with 1 drive missing, and asking mke2fs
where superblocks should lie.
Half of these locations have superblocks fsck is willing to go on. The
other doesn't. I just ran fsck -y on the md device, it's been fixing
things for 6 hours now and I suspect whatever is left afterwards will
not be of much use. I think the RAID mechanism is getting the stripes
wrong, resulting in a total mish-mash of filesystem internals for
e2fsck.

I can attempt doing that with all the valid superblocks I found (about
8) and I can attempt assembling other combinations of the four
physical drives. My most serious concern is a lot of photos that lived
on this box, so mild filesystem corruption that will affect some
percent of the small files would be more than tolerable.

My questions:
1. Short of seeing a specialist shop, is there any recommended course
of action to restore data off the array data? (technical suggestions
more than welcome, recommendations for sanely-priced commercial
software are also welcome)
My current plan is to try and salvage some stuff attempting fsck using
the various valid superblocks, copy off any files that look half-sane,
then trying again with a different subset of the four drives. This is
very time-consuming (a restoration of the 3 drives is ~10hrs over Gig
ethernet) but seems the first thing to try.

2. Is there some minimally-destructive and readily available way to
pool all four drives together and rather than reconstruct one entire
drive from the data on the other thre, attempt reconstructing every
stripe separately (based on the assumption drive X can help
reconstruct a stripe where drive Y got clobbered and vice versa)?

3. If you think I should be flogged for not verifying the backup is
AOK before diving in, kindly take a number and stand in line. I shall
be distributing low-differential SCSI cables for flogging shortly.

Thank you!

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]




=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



RAID5/ext3 trouble...

2007-03-04 Thread Gunny Smith

Hey all

I formerly had a gentoo box with a 4x250GB software RAID5, no LVM,
ext3 filesystem, using disks sdc,sdd,sde and sdf. sda and sdb were the
OS disks.
I was to move the RAID over to a new machine running debian. Further
to this, my backup of the data turned out to be only partial.

The new disks in the system appeared as sda, sdb, sdc and sdd.

I've then done a series of damaging things with them:

1. I've added 3 of them as spares rather than as drives with existing data.
2. When trying to assemble the array, this started rebuilding
(effectively, wiping) the fourth disk. I did this several times (and I
suspect I've clobbered data on more than one drive) before realizing I
can sidestep this by assembling three drives at a time - which raises
the array as degraded and doesn't kick off anything that can be
destructive.
3. I've run --create
4. I've cleared the persistent superblocks off the drives and
attempted to recreate them.

Roughly at this point I decided this was unsalvageable and went to the
backup, only to find out it failed to back up parts of the data. A
memorable moment in my life.

I've then blocked together a bunch of other harddrives to assemble a
~800GB volume, plonked an LVM on it, shared it as nfs, and made dd
if=/dev/sdX ... copies in files of all raw RAID drives over LAN. At
least I have a snapshot of things as they are right now before I start
fixing things again.

I attempted assembling a, b and c (d is the one I suspect is most
damaged) - Raid5 degraded with 1 drive missing, and asking mke2fs
where superblocks should lie.
Half of these locations have superblocks fsck is willing to go on. The
other doesn't. I just ran fsck -y on the md device, it's been fixing
things for 6 hours now and I suspect whatever is left afterwards will
not be of much use. I think the RAID mechanism is getting the stripes
wrong, resulting in a total mish-mash of filesystem internals for
e2fsck.

I can attempt doing that with all the valid superblocks I found (about
8) and I can attempt assembling other combinations of the four
physical drives. My most serious concern is a lot of photos that lived
on this box, so mild filesystem corruption that will affect some
percent of the small files would be more than tolerable.

My questions:
1. Short of seeing a specialist shop, is there any recommended course
of action to restore data off the array data? (technical suggestions
more than welcome, recommendations for sanely-priced commercial
software are also welcome)
My current plan is to try and salvage some stuff attempting fsck using
the various valid superblocks, copy off any files that look half-sane,
then trying again with a different subset of the four drives. This is
very time-consuming (a restoration of the 3 drives is ~10hrs over Gig
ethernet) but seems the first thing to try.

2. Is there some minimally-destructive and readily available way to
pool all four drives together and rather than reconstruct one entire
drive from the data on the other thre, attempt reconstructing every
stripe separately (based on the assumption drive X can help
reconstruct a stripe where drive Y got clobbered and vice versa)?

3. If you think I should be flogged for not verifying the backup is
AOK before diving in, kindly take a number and stand in line. I shall
be distributing low-differential SCSI cables for flogging shortly.

Thank you!

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]