On Oct 5,  6:12pm, Marc Mutz wrote:
} Subject: Re: networked RAID-1

Good morning to everyone following this thread.  I hope that this note
finds your day going well.

> That will do _nothing_ for you, because:
> 
> 1.) you can only mount it r/w on exactly one machine.
> 2.) even if 1) is ok for you, you cannot even mount the array ro on the
> other machines, because of Linux' disk caching.
> 
> Maybe raid1 over NBD (see linux/Documentation/nbd.txt) is what you want,
> but I don't know if that works.

... [ Description of NBD raid strategy deleted ] ...

> If this works, you can also add a third machine and make a threefold
> raid1 for added HA. Curious myself if this would work. Unfortunately
> cannot test this myself.

This strategy for doing HA has interested us as well.  Just a few
comments:

First of all the current NBD implementation, at least the pieces of it
that we have been able to find, is not sufficiently robust to
implement this strategy in a production environment.  The problems
that we have encountered up to this point with experimentation have
revolved around performance problems with the implementation.

The NBD server side is a userland program that is basically aimed at a
unit of storage of fixed size.  This can be done by either creating a
file of a fixed size with dd or using a partition and specifying the
size of the partition on the command line.  The client side of NBD is
another userland program which takes as arguements an NBD device node
and the hostname and port of the server program.  The client side
program essentially hooks up or binds the device node to the userland
server program.

The NBD driver essentially works but encapsulating or reducing I/O
requests into a structure (nbd_request ref. <linux/nbd.h>) which gets
flung across a TCP/IP connection to the server.  The nuts and bolts of
the structure from a server perspective is a 64 bit offset into the
storage and a 32 bit quantity to be returned.  The server side of the
equations is a reasonably simple program which simply seeks to the
desired offset, reads the requested quantity of data and returns it to
the client with a structure (nbd_reply ref. <linux/nbd.h>) nailed onto
the front of it.

The server side of the program seems to have some trouble dealing with
storage sizes larger than 2 gigabytes passed on the command line.
Since the 32 bit machines cannot create a file larger than 2 gigabytes
it is a necessity to use a raw partition and specify the storage size
if there is a desire to use NBD to access volumes of production size.

We have had success setting up NBD devices which have been than used
as the block device components for RAID0 stripes using the md driver.
Success basically consists of creating the RAID stripe, building a
filesystem and doing some basic I/O.  Attempts to go beyond this point
run into what seem to be serious performance problems with NBD.

I have spoken with both Pavel and sct about this issue and it would
seem that the performance problems are secondary to the NBD
implementation not handling out of order I/O requests.  This causes
abysmal performance with any kind of significant I/O.  It is fairly
easy to demonstrate this by running Bonnie with file sizes of around
200 megabytes or larger on either a raw NBD device or a RAID0 volume
constructed on top of NBD devices.

I had inquired from Pavel about the issues surrounding handling out of
order I/O requests but haven't heard back from him.  I have been
meaning to bounce this question off sct as well but haven't had the
time yet.  Since the I/O requests for RAID1 need to be committed to
both mirror volumes to insure data integrity poor performance of the
network mirror would essentially kill a production implementation
since the process doing I/O hands in D state until I/O completes.

The majority of the work that we have done with NBD has been from a
RAID0 perspective in a Beowulf array.  We have been investigating the
feasibility of using NBD to harness parallel I/O performance for
scratch volumes used in quantum mechanical orbital calculations
(molecular energy simulations).

We have been successful with smaller calculations which write about
120-140 megabytes of 2-electron integral data to the NBD backed RAID0
stripe.  Going to larger basis sets which upped the size of the 2-e
data files to around 300 megabytes results in dismal performance with
greater than an order of magnitude increases in time all secondary to
I/O delays.

This discussion shouldn't be taken as an indictment of any of the work
that has been done but rather a summary analysis of the issues
involved in taking this technology to the next level.  This strategy
would offer some very intrigueing possibilities with respect to doing
HA but there are issues that do need to be addressed.

> Marc

Have a pleasant remainder of the week everyone.

Greg

}-- End of excerpt from Marc Mutz

As always,
Dr. G.W. Wettstein           Enjellic Systems Development - Specializing
4206 N. 19th Ave.            in information infra-structure solutions.
Fargo, ND  58102             WWW: http://www.enjellic.com
Phone: 701-281-1686          EMAIL: [EMAIL PROTECTED]
------------------------------------------------------------------------------
"We trained hard......but every time we were beginning to form up into
teams, we would be reorganised. I was to learn later in life that we
tend to meet any new situations by reorganising.......  and a
wonderful process it can be for creating the illusion of progress,
while producing inefficiency and demoralisation."
                                -- Petronius (6 AD)

Reply via email to