[RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Al Boldi
Lars Ellenberg wrote:
> meanwhile, please, anyone interessted,
> the drbd paper for LinuxConf Eu 2007 is finalized.
> http://www.drbd.org/fileadmin/drbd/publications/
> drbd8.linux-conf.eu.2007.pdf
>
> it does not give too much implementation detail (would be inapropriate
> for conference proceedings, imo; some paper commenting on the source
> code should follow).
>
> but it does give a good overview about what DRBD actually is,
> what exact problems it tries to solve,
> and what developments to expect in the near future.
>
> so you can make up your mind about
>  "Do we need it?", and
>  "Why DRBD? Why not NBD + MD-RAID?"

Ok, conceptually your driver sounds really interresting, but when I read the 
pdf I got completely turned off.  The problem is that the concepts are not 
clearly implemented, when in fact the concepts are really simple:

  Allow shared access to remote block storage with fault tolerance.

The first thing to tackle here would be write serialization.  Then start 
thinking about fault tolerance.

Now, shared remote block access should theoretically be handled, as does 
DRBD, by a block layer driver, but realistically it may be more appropriate 
to let it be handled by the combining end user, like OCFS or GFS.

The idea here is to simplify lower layer implementations while removing any 
preconceived dependencies, and let upper layers reign free without incurring 
redundant overhead.

Look at ZFS; it illegally violates layering by combining md/dm/lvm with the 
fs, but it does this based on a realistic understanding of the problems 
involved, which enables it to improve performance, flexibility, and 
functionality specific to its use case.

This implies that there are two distinct forces at work here:

  1. Layer components
  2. Use-Case composers

Layer components should technically not implement any use case (other than 
providing a plumbing framework), as that would incur unnecessary 
dependencies, which could reduce its generality and thus reusability.

Use-Case composers can now leverage layer components from across the layering 
hierarchy, to yield a specific use case implementation.

DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, 
whereas aoe / nbd / loop and the VFS / FUSE are examples of layer 
components.

It follows that Use-case composers, like DRBD, need common functionality that 
should be factored out into layer components, and then recompose to 
implement a specific use case.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Jan Engelhardt

On Aug 12 2007 13:35, Al Boldi wrote:
>Lars Ellenberg wrote:
>> meanwhile, please, anyone interessted,
>> the drbd paper for LinuxConf Eu 2007 is finalized.
>> http://www.drbd.org/fileadmin/drbd/publications/
>> drbd8.linux-conf.eu.2007.pdf
>>
>> but it does give a good overview about what DRBD actually is,
>> what exact problems it tries to solve,
>> and what developments to expect in the near future.
>>
>> so you can make up your mind about
>>  "Do we need it?", and
>>  "Why DRBD? Why not NBD + MD-RAID?"

I may have made a mistake when asking for how it compares to NBD+MD.
Let me retry: what's the functional difference between
GFS2 on a DRBD .vs. GFS2 on a DAS SAN?

>Now, shared remote block access should theoretically be handled, as does 
>DRBD, by a block layer driver, but realistically it may be more appropriate 
>to let it be handled by the combining end user, like OCFS or GFS.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Evgeniy Polyakov
On Sun, Aug 12, 2007 at 01:35:17PM +0300, Al Boldi ([EMAIL PROTECTED]) wrote:
> Lars Ellenberg wrote:
> > meanwhile, please, anyone interessted,
> > the drbd paper for LinuxConf Eu 2007 is finalized.
> > http://www.drbd.org/fileadmin/drbd/publications/
> > drbd8.linux-conf.eu.2007.pdf
> >
> > it does not give too much implementation detail (would be inapropriate
> > for conference proceedings, imo; some paper commenting on the source
> > code should follow).
> >
> > but it does give a good overview about what DRBD actually is,
> > what exact problems it tries to solve,
> > and what developments to expect in the near future.
> >
> > so you can make up your mind about
> >  "Do we need it?", and
> >  "Why DRBD? Why not NBD + MD-RAID?"
> 
> Ok, conceptually your driver sounds really interresting, but when I read the 
> pdf I got completely turned off.  The problem is that the concepts are not 
> clearly implemented, when in fact the concepts are really simple:
> 
>   Allow shared access to remote block storage with fault tolerance.
> 
> The first thing to tackle here would be write serialization.  Then start 
> thinking about fault tolerance.
> 
> Now, shared remote block access should theoretically be handled, as does 
> DRBD, by a block layer driver, but realistically it may be more appropriate 
> to let it be handled by the combining end user, like OCFS or GFS.
> 
> The idea here is to simplify lower layer implementations while removing any 
> preconceived dependencies, and let upper layers reign free without incurring 
> redundant overhead.
> 
> Look at ZFS; it illegally violates layering by combining md/dm/lvm with the 
> fs, but it does this based on a realistic understanding of the problems 
> involved, which enables it to improve performance, flexibility, and 
> functionality specific to its use case.
> 
> This implies that there are two distinct forces at work here:
> 
>   1. Layer components
>   2. Use-Case composers
> 
> Layer components should technically not implement any use case (other than 
> providing a plumbing framework), as that would incur unnecessary 
> dependencies, which could reduce its generality and thus reusability.
> 
> Use-Case composers can now leverage layer components from across the layering 
> hierarchy, to yield a specific use case implementation.
> 
> DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, 
> whereas aoe / nbd / loop and the VFS / FUSE are examples of layer 
> components.
> 
> It follows that Use-case composers, like DRBD, need common functionality that 
> should be factored out into layer components, and then recompose to 
> implement a specific use case.

Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs
on top of distributed storage (which is a urprise to me, that holy zfs
suppors that)?
 
> Thanks!
> 
> --
> Al
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Al Boldi
Evgeniy Polyakov wrote:
> Al Boldi ([EMAIL PROTECTED]) wrote:
> > Look at ZFS; it illegally violates layering by combining md/dm/lvm with
> > the fs, but it does this based on a realistic understanding of the
> > problems involved, which enables it to improve performance, flexibility,
> > and functionality specific to its use case.
> >
> > This implies that there are two distinct forces at work here:
> >
> >   1. Layer components
> >   2. Use-Case composers
> >
> > Layer components should technically not implement any use case (other
> > than providing a plumbing framework), as that would incur unnecessary
> > dependencies, which could reduce its generality and thus reusability.
> >
> > Use-Case composers can now leverage layer components from across the
> > layering hierarchy, to yield a specific use case implementation.
> >
> > DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in
> > general, whereas aoe / nbd / loop and the VFS / FUSE are examples of
> > layer components.
> >
> > It follows that Use-case composers, like DRBD, need common functionality
> > that should be factored out into layer components, and then recompose to
> > implement a specific use case.
>
> Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs
> on top of distributed storage (which is a urprise to me, that holy zfs
> suppors that)?

Actually, I may not have been very clear in my Use-Case composer description 
to mean internal in-kernel Use-Case composer as opposed to external Userland 
Use-Case composer.

So, nbd+dm+raid1 would be an external Userland Use-Case composition, which 
obviously could have some drastic performance issues.

DRBD and ZFS are examples of internal in-kernel Use-Case composers, which 
obviously could show some drastic performance improvements.  

Although you could allow in-kernel Use-Case composers to be run on top of 
Userland Use-Case composers, that wouldn't be the preferred mode of 
operation.  Instead, you would for example recompose ZFS to incorporate an 
in-kernel distributed storage layer component, like nbd.

All this boils down to refactoring Use-Case composers to produce layer 
components with both in-kernel and userland interfaces.  Once we have that, 
it becomes a matter of plug-and-play to produce something awesome like ZFS.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david

On Sun, 12 Aug 2007, Jan Engelhardt wrote:


On Aug 12 2007 13:35, Al Boldi wrote:

Lars Ellenberg wrote:

meanwhile, please, anyone interessted,
the drbd paper for LinuxConf Eu 2007 is finalized.
http://www.drbd.org/fileadmin/drbd/publications/
drbd8.linux-conf.eu.2007.pdf

but it does give a good overview about what DRBD actually is,
what exact problems it tries to solve,
and what developments to expect in the near future.

so you can make up your mind about
 "Do we need it?", and
 "Why DRBD? Why not NBD + MD-RAID?"


I may have made a mistake when asking for how it compares to NBD+MD.
Let me retry: what's the functional difference between
GFS2 on a DRBD .vs. GFS2 on a DAS SAN?


GFS is a distributed filesystem, DRDB is a replicated block device. you 
wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc


DRDB is much closer to the NBD+MD option.

now, I am not an expert on either option, but three are a couple things 
that I would question about the DRDB+MD option


1. when the remote machine is down, how does MD deal with it for reads and 
writes?


2. MD over local drive will alternate reads between mirrors (or so I've 
been told), doing so over the network is wrong.


3. when writing, will MD wait for the network I/O to get the data saved on 
the backup before returning from the syscall? or can it sync the data out 
lazily



Now, shared remote block access should theoretically be handled, as does
DRBD, by a block layer driver, but realistically it may be more appropriate
to let it be handled by the combining end user, like OCFS or GFS.


there are times when you want to replicate at the block layer, and there 
are times when you want to have a filesystem do the work. don't force a 
filesystem on use-cases where a block device is the right answer.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Jan Engelhardt

On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
>
> now, I am not an expert on either option, but three are a couple things that I
> would question about the DRDB+MD option
>
> 1. when the remote machine is down, how does MD deal with it for reads and
> writes?

I suppose it kicks the drive and you'd have to re-add it by hand unless done by
a cronjob.

> 2. MD over local drive will alternate reads between mirrors (or so I've been
> told), doing so over the network is wrong.

Certainly. In which case you set "write_mostly" (or even write_only, not sure
of its name) on the raid component that is nbd.

> 3. when writing, will MD wait for the network I/O to get the data saved on the
> backup before returning from the syscall? or can it sync the data out lazily

Can't answer this one - ask Neil :)




Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Iustin Pop
On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
> 
> On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
> >
> > now, I am not an expert on either option, but three are a couple things 
> > that I
> > would question about the DRDB+MD option
> >
> > 1. when the remote machine is down, how does MD deal with it for reads and
> > writes?
> 
> I suppose it kicks the drive and you'd have to re-add it by hand unless done 
> by
> a cronjob.

>From my tests, since NBD doesn't have a timeout option, MD hangs in the
write to that mirror indefinitely, somewhat like when dealing with a
broken IDE driver/chipset/disk.

> > 2. MD over local drive will alternate reads between mirrors (or so I've been
> > told), doing so over the network is wrong.
> 
> Certainly. In which case you set "write_mostly" (or even write_only, not sure
> of its name) on the raid component that is nbd.
> 
> > 3. when writing, will MD wait for the network I/O to get the data saved on 
> > the
> > backup before returning from the syscall? or can it sync the data out lazily
> 
> Can't answer this one - ask Neil :)

MD has the write-mostly/write-behind options - which help in this case
but only up to a certain amount.


In my experience DRBD wins hands-down over MD+NBD because of MD doesn't
know (or handle) a component that never returns from a write, which is
quite different from returning with an error. Furthermore, DRBD was
designed to handle transient errors in the connection to the peer due to
its network-oriented design, whereas MD is mostly designed with local or
at least high-reliability disks (where disk can be SAN, SCSI, etc.) and
a failure is not normal for MD. Thus the need for manual reconnect in MD
case and the automated handling of reconnects in case of DRBD.

I'm just a happy user of both MD over local disks and DRBD for networked
raid.

regards,
iustin
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Paul Clements

Iustin Pop wrote:

On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:

On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:

now, I am not an expert on either option, but three are a couple things that I
would question about the DRDB+MD option

1. when the remote machine is down, how does MD deal with it for reads and
writes?

I suppose it kicks the drive and you'd have to re-add it by hand unless done by
a cronjob.


Yes, and with a bitmap configured on the raid1, you just resync the 
blocks that have been written while the connection was down.




From my tests, since NBD doesn't have a timeout option, MD hangs in the

write to that mirror indefinitely, somewhat like when dealing with a
broken IDE driver/chipset/disk.


Well, if people would like to see a timeout option, I actually coded up 
a patch a couple of years ago to do just that, but I never got it into 
mainline because you can do almost as well by doing a check at 
user-level (I basically ping the nbd connection periodically and if it 
fails, I kill -9 the nbd-client).




2. MD over local drive will alternate reads between mirrors (or so I've been
told), doing so over the network is wrong.

Certainly. In which case you set "write_mostly" (or even write_only, not sure
of its name) on the raid component that is nbd.


3. when writing, will MD wait for the network I/O to get the data saved on the
backup before returning from the syscall? or can it sync the data out lazily

Can't answer this one - ask Neil :)


MD has the write-mostly/write-behind options - which help in this case
but only up to a certain amount.


You can configure write_behind (aka, asynchronous writes) to buffer as 
much data as you have RAM to hold. At a certain point, presumably, you'd 
want to just break the mirror and take the hit of doing a resync once 
your network leg falls too far behind.


--
Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david
per the message below MD (or DM) would need to be modified to work 
reasonably well with one of the disk components being over an unreliable 
link (like a network link)


are the MD/DM maintainers interested in extending their code in this 
direction? or would they prefer to keep it simpler by being able to 
continue to assume that the raid components are connected over a highly 
reliable connection?


if they are interested in adding (and maintaining) this functionality then 
there is a real possibility that NBD+MD/DM could eliminate the need for 
DRDB. however if they are not interested in adding all the code to deal 
with the network type issues, then the argument that DRDB should not be 
merged becouse you can do the same thing with MD/DM + NBD is invalid and 
can be dropped/ignored


David Lang

On Sun, 12 Aug 2007, Paul Clements wrote:


Iustin Pop wrote:

 On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
>  On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
> >  now, I am not an expert on either option, but three are a couple 
> >  things that I

> >  would question about the DRDB+MD option
> > 
> >  1. when the remote machine is down, how does MD deal with it for reads 
> >  and

> >  writes?
>  I suppose it kicks the drive and you'd have to re-add it by hand unless 
>  done by

>  a cronjob.


Yes, and with a bitmap configured on the raid1, you just resync the blocks 
that have been written while the connection was down.




>From my tests, since NBD doesn't have a timeout option, MD hangs in the
 write to that mirror indefinitely, somewhat like when dealing with a
 broken IDE driver/chipset/disk.


Well, if people would like to see a timeout option, I actually coded up a 
patch a couple of years ago to do just that, but I never got it into mainline 
because you can do almost as well by doing a check at user-level (I basically 
ping the nbd connection periodically and if it fails, I kill -9 the 
nbd-client).



> >  2. MD over local drive will alternate reads between mirrors (or so 
> >  I've been

> >  told), doing so over the network is wrong.
>  Certainly. In which case you set "write_mostly" (or even write_only, not 
>  sure

>  of its name) on the raid component that is nbd.
> 
> >  3. when writing, will MD wait for the network I/O to get the data 
> >  saved on the
> >  backup before returning from the syscall? or can it sync the data out 
> >  lazily

>  Can't answer this one - ask Neil :)

 MD has the write-mostly/write-behind options - which help in this case
 but only up to a certain amount.


You can configure write_behind (aka, asynchronous writes) to buffer as much 
data as you have RAM to hold. At a certain point, presumably, you'd want to 
just break the mirror and take the hit of doing a resync once your network 
leg falls too far behind.


--
Paul


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html