[RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
Lars Ellenberg wrote: > meanwhile, please, anyone interessted, > the drbd paper for LinuxConf Eu 2007 is finalized. > http://www.drbd.org/fileadmin/drbd/publications/ > drbd8.linux-conf.eu.2007.pdf > > it does not give too much implementation detail (would be inapropriate > for conference proceedings, imo; some paper commenting on the source > code should follow). > > but it does give a good overview about what DRBD actually is, > what exact problems it tries to solve, > and what developments to expect in the near future. > > so you can make up your mind about > "Do we need it?", and > "Why DRBD? Why not NBD + MD-RAID?" Ok, conceptually your driver sounds really interresting, but when I read the pdf I got completely turned off. The problem is that the concepts are not clearly implemented, when in fact the concepts are really simple: Allow shared access to remote block storage with fault tolerance. The first thing to tackle here would be write serialization. Then start thinking about fault tolerance. Now, shared remote block access should theoretically be handled, as does DRBD, by a block layer driver, but realistically it may be more appropriate to let it be handled by the combining end user, like OCFS or GFS. The idea here is to simplify lower layer implementations while removing any preconceived dependencies, and let upper layers reign free without incurring redundant overhead. Look at ZFS; it illegally violates layering by combining md/dm/lvm with the fs, but it does this based on a realistic understanding of the problems involved, which enables it to improve performance, flexibility, and functionality specific to its use case. This implies that there are two distinct forces at work here: 1. Layer components 2. Use-Case composers Layer components should technically not implement any use case (other than providing a plumbing framework), as that would incur unnecessary dependencies, which could reduce its generality and thus reusability. Use-Case composers can now leverage layer components from across the layering hierarchy, to yield a specific use case implementation. DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, whereas aoe / nbd / loop and the VFS / FUSE are examples of layer components. It follows that Use-case composers, like DRBD, need common functionality that should be factored out into layer components, and then recompose to implement a specific use case. Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Aug 12 2007 13:35, Al Boldi wrote: >Lars Ellenberg wrote: >> meanwhile, please, anyone interessted, >> the drbd paper for LinuxConf Eu 2007 is finalized. >> http://www.drbd.org/fileadmin/drbd/publications/ >> drbd8.linux-conf.eu.2007.pdf >> >> but it does give a good overview about what DRBD actually is, >> what exact problems it tries to solve, >> and what developments to expect in the near future. >> >> so you can make up your mind about >> "Do we need it?", and >> "Why DRBD? Why not NBD + MD-RAID?" I may have made a mistake when asking for how it compares to NBD+MD. Let me retry: what's the functional difference between GFS2 on a DRBD .vs. GFS2 on a DAS SAN? >Now, shared remote block access should theoretically be handled, as does >DRBD, by a block layer driver, but realistically it may be more appropriate >to let it be handled by the combining end user, like OCFS or GFS. Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Sun, Aug 12, 2007 at 01:35:17PM +0300, Al Boldi ([EMAIL PROTECTED]) wrote: > Lars Ellenberg wrote: > > meanwhile, please, anyone interessted, > > the drbd paper for LinuxConf Eu 2007 is finalized. > > http://www.drbd.org/fileadmin/drbd/publications/ > > drbd8.linux-conf.eu.2007.pdf > > > > it does not give too much implementation detail (would be inapropriate > > for conference proceedings, imo; some paper commenting on the source > > code should follow). > > > > but it does give a good overview about what DRBD actually is, > > what exact problems it tries to solve, > > and what developments to expect in the near future. > > > > so you can make up your mind about > > "Do we need it?", and > > "Why DRBD? Why not NBD + MD-RAID?" > > Ok, conceptually your driver sounds really interresting, but when I read the > pdf I got completely turned off. The problem is that the concepts are not > clearly implemented, when in fact the concepts are really simple: > > Allow shared access to remote block storage with fault tolerance. > > The first thing to tackle here would be write serialization. Then start > thinking about fault tolerance. > > Now, shared remote block access should theoretically be handled, as does > DRBD, by a block layer driver, but realistically it may be more appropriate > to let it be handled by the combining end user, like OCFS or GFS. > > The idea here is to simplify lower layer implementations while removing any > preconceived dependencies, and let upper layers reign free without incurring > redundant overhead. > > Look at ZFS; it illegally violates layering by combining md/dm/lvm with the > fs, but it does this based on a realistic understanding of the problems > involved, which enables it to improve performance, flexibility, and > functionality specific to its use case. > > This implies that there are two distinct forces at work here: > > 1. Layer components > 2. Use-Case composers > > Layer components should technically not implement any use case (other than > providing a plumbing framework), as that would incur unnecessary > dependencies, which could reduce its generality and thus reusability. > > Use-Case composers can now leverage layer components from across the layering > hierarchy, to yield a specific use case implementation. > > DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, > whereas aoe / nbd / loop and the VFS / FUSE are examples of layer > components. > > It follows that Use-case composers, like DRBD, need common functionality that > should be factored out into layer components, and then recompose to > implement a specific use case. Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs on top of distributed storage (which is a urprise to me, that holy zfs suppors that)? > Thanks! > > -- > Al > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
Evgeniy Polyakov wrote: > Al Boldi ([EMAIL PROTECTED]) wrote: > > Look at ZFS; it illegally violates layering by combining md/dm/lvm with > > the fs, but it does this based on a realistic understanding of the > > problems involved, which enables it to improve performance, flexibility, > > and functionality specific to its use case. > > > > This implies that there are two distinct forces at work here: > > > > 1. Layer components > > 2. Use-Case composers > > > > Layer components should technically not implement any use case (other > > than providing a plumbing framework), as that would incur unnecessary > > dependencies, which could reduce its generality and thus reusability. > > > > Use-Case composers can now leverage layer components from across the > > layering hierarchy, to yield a specific use case implementation. > > > > DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in > > general, whereas aoe / nbd / loop and the VFS / FUSE are examples of > > layer components. > > > > It follows that Use-case composers, like DRBD, need common functionality > > that should be factored out into layer components, and then recompose to > > implement a specific use case. > > Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs > on top of distributed storage (which is a urprise to me, that holy zfs > suppors that)? Actually, I may not have been very clear in my Use-Case composer description to mean internal in-kernel Use-Case composer as opposed to external Userland Use-Case composer. So, nbd+dm+raid1 would be an external Userland Use-Case composition, which obviously could have some drastic performance issues. DRBD and ZFS are examples of internal in-kernel Use-Case composers, which obviously could show some drastic performance improvements. Although you could allow in-kernel Use-Case composers to be run on top of Userland Use-Case composers, that wouldn't be the preferred mode of operation. Instead, you would for example recompose ZFS to incorporate an in-kernel distributed storage layer component, like nbd. All this boils down to refactoring Use-Case composers to produce layer components with both in-kernel and userland interfaces. Once we have that, it becomes a matter of plug-and-play to produce something awesome like ZFS. Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Sun, 12 Aug 2007, Jan Engelhardt wrote: On Aug 12 2007 13:35, Al Boldi wrote: Lars Ellenberg wrote: meanwhile, please, anyone interessted, the drbd paper for LinuxConf Eu 2007 is finalized. http://www.drbd.org/fileadmin/drbd/publications/ drbd8.linux-conf.eu.2007.pdf but it does give a good overview about what DRBD actually is, what exact problems it tries to solve, and what developments to expect in the near future. so you can make up your mind about "Do we need it?", and "Why DRBD? Why not NBD + MD-RAID?" I may have made a mistake when asking for how it compares to NBD+MD. Let me retry: what's the functional difference between GFS2 on a DRBD .vs. GFS2 on a DAS SAN? GFS is a distributed filesystem, DRDB is a replicated block device. you wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc DRDB is much closer to the NBD+MD option. now, I am not an expert on either option, but three are a couple things that I would question about the DRDB+MD option 1. when the remote machine is down, how does MD deal with it for reads and writes? 2. MD over local drive will alternate reads between mirrors (or so I've been told), doing so over the network is wrong. 3. when writing, will MD wait for the network I/O to get the data saved on the backup before returning from the syscall? or can it sync the data out lazily Now, shared remote block access should theoretically be handled, as does DRBD, by a block layer driver, but realistically it may be more appropriate to let it be handled by the combining end user, like OCFS or GFS. there are times when you want to replicate at the block layer, and there are times when you want to have a filesystem do the work. don't force a filesystem on use-cases where a block device is the right answer. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote: > > now, I am not an expert on either option, but three are a couple things that I > would question about the DRDB+MD option > > 1. when the remote machine is down, how does MD deal with it for reads and > writes? I suppose it kicks the drive and you'd have to re-add it by hand unless done by a cronjob. > 2. MD over local drive will alternate reads between mirrors (or so I've been > told), doing so over the network is wrong. Certainly. In which case you set "write_mostly" (or even write_only, not sure of its name) on the raid component that is nbd. > 3. when writing, will MD wait for the network I/O to get the data saved on the > backup before returning from the syscall? or can it sync the data out lazily Can't answer this one - ask Neil :) Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote: > > On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote: > > > > now, I am not an expert on either option, but three are a couple things > > that I > > would question about the DRDB+MD option > > > > 1. when the remote machine is down, how does MD deal with it for reads and > > writes? > > I suppose it kicks the drive and you'd have to re-add it by hand unless done > by > a cronjob. >From my tests, since NBD doesn't have a timeout option, MD hangs in the write to that mirror indefinitely, somewhat like when dealing with a broken IDE driver/chipset/disk. > > 2. MD over local drive will alternate reads between mirrors (or so I've been > > told), doing so over the network is wrong. > > Certainly. In which case you set "write_mostly" (or even write_only, not sure > of its name) on the raid component that is nbd. > > > 3. when writing, will MD wait for the network I/O to get the data saved on > > the > > backup before returning from the syscall? or can it sync the data out lazily > > Can't answer this one - ask Neil :) MD has the write-mostly/write-behind options - which help in this case but only up to a certain amount. In my experience DRBD wins hands-down over MD+NBD because of MD doesn't know (or handle) a component that never returns from a write, which is quite different from returning with an error. Furthermore, DRBD was designed to handle transient errors in the connection to the peer due to its network-oriented design, whereas MD is mostly designed with local or at least high-reliability disks (where disk can be SAN, SCSI, etc.) and a failure is not normal for MD. Thus the need for manual reconnect in MD case and the automated handling of reconnects in case of DRBD. I'm just a happy user of both MD over local disks and DRBD for networked raid. regards, iustin - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
Iustin Pop wrote: On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote: On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote: now, I am not an expert on either option, but three are a couple things that I would question about the DRDB+MD option 1. when the remote machine is down, how does MD deal with it for reads and writes? I suppose it kicks the drive and you'd have to re-add it by hand unless done by a cronjob. Yes, and with a bitmap configured on the raid1, you just resync the blocks that have been written while the connection was down. From my tests, since NBD doesn't have a timeout option, MD hangs in the write to that mirror indefinitely, somewhat like when dealing with a broken IDE driver/chipset/disk. Well, if people would like to see a timeout option, I actually coded up a patch a couple of years ago to do just that, but I never got it into mainline because you can do almost as well by doing a check at user-level (I basically ping the nbd connection periodically and if it fails, I kill -9 the nbd-client). 2. MD over local drive will alternate reads between mirrors (or so I've been told), doing so over the network is wrong. Certainly. In which case you set "write_mostly" (or even write_only, not sure of its name) on the raid component that is nbd. 3. when writing, will MD wait for the network I/O to get the data saved on the backup before returning from the syscall? or can it sync the data out lazily Can't answer this one - ask Neil :) MD has the write-mostly/write-behind options - which help in this case but only up to a certain amount. You can configure write_behind (aka, asynchronous writes) to buffer as much data as you have RAM to hold. At a certain point, presumably, you'd want to just break the mirror and take the hit of doing a resync once your network leg falls too far behind. -- Paul - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
per the message below MD (or DM) would need to be modified to work reasonably well with one of the disk components being over an unreliable link (like a network link) are the MD/DM maintainers interested in extending their code in this direction? or would they prefer to keep it simpler by being able to continue to assume that the raid components are connected over a highly reliable connection? if they are interested in adding (and maintaining) this functionality then there is a real possibility that NBD+MD/DM could eliminate the need for DRDB. however if they are not interested in adding all the code to deal with the network type issues, then the argument that DRDB should not be merged becouse you can do the same thing with MD/DM + NBD is invalid and can be dropped/ignored David Lang On Sun, 12 Aug 2007, Paul Clements wrote: Iustin Pop wrote: On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote: > On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote: > > now, I am not an expert on either option, but three are a couple > > things that I > > would question about the DRDB+MD option > > > > 1. when the remote machine is down, how does MD deal with it for reads > > and > > writes? > I suppose it kicks the drive and you'd have to re-add it by hand unless > done by > a cronjob. Yes, and with a bitmap configured on the raid1, you just resync the blocks that have been written while the connection was down. >From my tests, since NBD doesn't have a timeout option, MD hangs in the write to that mirror indefinitely, somewhat like when dealing with a broken IDE driver/chipset/disk. Well, if people would like to see a timeout option, I actually coded up a patch a couple of years ago to do just that, but I never got it into mainline because you can do almost as well by doing a check at user-level (I basically ping the nbd connection periodically and if it fails, I kill -9 the nbd-client). > > 2. MD over local drive will alternate reads between mirrors (or so > > I've been > > told), doing so over the network is wrong. > Certainly. In which case you set "write_mostly" (or even write_only, not > sure > of its name) on the raid component that is nbd. > > > 3. when writing, will MD wait for the network I/O to get the data > > saved on the > > backup before returning from the syscall? or can it sync the data out > > lazily > Can't answer this one - ask Neil :) MD has the write-mostly/write-behind options - which help in this case but only up to a certain amount. You can configure write_behind (aka, asynchronous writes) to buffer as much data as you have RAM to hold. At a certain point, presumably, you'd want to just break the mirror and take the hit of doing a resync once your network leg falls too far behind. -- Paul - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html