Re: [PATCH 00/37] Permit filesystem local caching
On Tuesday 26 February 2008 06:33, David Howells wrote: > > Suppose one were to take a mundane approach to the persistent cache > > problem instead of layering filesystems. What you would do then is > > change NFS's ->write_page and variants to fiddle the persistent > > cache > > It is a requirement laid down by the Linux NFS fs maintainers that the writes > to the cache be asynchronous, even if the writes to NFS aren't. As it happens, I will be hanging out for the next few days with said NFS maintainers, it would help to be as informed as possible about your patch set. > Note further that NFS's write_page() != writing to the cache. Writing to the > cache is typically done by NFS's readpages(). Yes, of course. But also by ->write_page no? > > Which I could eventually find out by reading all the patches but asking you > > is so much more fun :-) > > And a waste of my time. I've provided documentation in the main FS-Cache > patch, both as text files and in comments in header files that answer your > questions. Please read them first. 37 Patches, none of which has "Documentation" in the subject line, and you did not provide a diffstat in patch 0 for the patch set as a whole. If I had known it was there of course I would have read it. It is great to see this level of documentation. But I do not think it is fair to blame your (one) reader for missing it. See the smiley above? The _real_ reason I am asking you is that I do not think anybody understands your patch set, in spite of your considerable efforts to address that. Discussion in public, right or wrong, is the only way to fix that. It is counterproductive to drive readers away from the discussion for fear that they may miss some point obvious to the original author, or perhaps already discussed earlier on lkml, and get flamed for it. Obviously, the patch set is not going to be perfect when it goes in and it would be a silly abuse of the open source process to require that, but the parts where it touches the rest of the system have to be really well understood, and it is clear from the level of participation in the thread that they are not. One bit that already came out of this, which you have alluded to several times yourself but somehow seem to keep glossing over, is that you need a ->direct_bio file operations method. So does loopback mount. It might be worth putting some effort into seeing how ->direct_IO can be refactored to make that happen. You can get it in separately on the basis of helping loopback, and it will make your patches nicer. Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
I need to respond to this in pieces... first the bit that is bugging me: > > * two new page flags > > I need to keep track of two bits of per-cached-page information: > > (1) This page is known by the cache, and that the cache must be informed if > the page is going to go away. I still do not understand the life cycle of this bit. What does the cache do when it learns the page has gone away? How is it informed? Who owns the page cache in which such a page lives, the nfs client? Filesystem that hosts the page? A third page cache owned by the cache itself? (See my basic confusion about how many page cache levels you have, below.) Suppose one were to take a mundane approach to the persistent cache problem instead of layering filesystems. What you would do then is change NFS's ->write_page and variants to fiddle the persistent cache as well as the network, instead of just the network as now. This fiddling could even consist of ->write calls to another filesystem, though working directly with the bio interface would yield the fastest, and therefore to my mind, best result. In any case, you find out how to write the page to backing store by asking the filesystem, which in the naive approach would be nfs augmented with caching library calls. The filesystem keeps its own metadata around to know how to map the page to disk. So again naively, this metadata could tell the nfs client that the page is not mapped to disk at all. So I do not see what your per-page bit is for, obviously because I do not fully understand your caching scheme. Which I could eventually find out by reading all the patches but asking you is so much more fun :-) By the way, how many levels of page caching for the same data are there, is it: 1) nfs client 2) cache layer's own page cache 3) filesystem hosting the cache or just: 1) nfs client page cache 2) filesystem hosting the cache I think it is the second, but that is already double caching, which has got to hurt. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Monday 25 February 2008 15:19, David Howells wrote: > So I guess there's a problem in cachefiles's efficiency - possibly due > to the fact that it tries to be fully asynchronous. OK, not just my imagination, and it makes me feel better about the patch set because efficiency bugs are fixable while fundamental limitations are not. How much of a hurry are you in to merge this feature? You have bits like this: "Add a function to install a monitor on the page lock waitqueue for a particular page, thus allowing the page being unlocked to be detected. This is used by CacheFiles to detect read completion on a page in the backing filesystem so that it can then copy the data to the waiting netfs page." We already have that hook, it is called bio_endio. My strong intuition is that your whole mechanism should sit directly on the block device, no matter how attractive it seems to be able to piggyback on the namespace and layout management code of existing filesystems. I see your current effort as the moral equivalent of FUSE: you are able to demonstrate certain desirable behavioral properties, but you are unable to reach full theoretical efficiency because there are layers and layers of interface gunk interposed between the netfs user and the cache device. That said, I also see you have put a huge amount of work into this over the years, it is nicely broken out, you are responsive and easy to work with, all arguments for an early merge. Against that, you invade core kernel for reasons that are not necessarily justified: * two new page flags * a new fileops method * many changes to LSM including new object class and new hooks * separate fs*id from task struct * new page-private destructor hook * probably other bits I missed Would it be correct to say that some of these changes are to support disconnected operation? If so, you really have two patch sets: 1) Persistent netfs cache 2) Disconnected netfs operation You have some short snappers that look generally useful: * add_wait_queue_tail (cool) * write to a file without a struct file (includes ->mapping cleanup, probably good) * export fsync_super Why not hunt around for existing in-kernel users that would benefit so these can be submitted as standalone patches, shortening the remaining patch set and partially overcoming objections due to core kernel changes? One thing I don't see is users coming on to lkml and saying "please merge this, it works great for me". Since you probably have such users, why not give them a poke? Your cachefilesd is going to need anti-deadlock medicine like ddsnap has. Since you don't seem at all worried about that right now, I suspect you have not hammered this code really heavily, correct? Without preventative measures, any memory-using daemon sitting in the block IO path will deadlock if you hit it hard enough. A couple of years ago you explained the purpose of the new page flags to me and there is no way I can find that email again. Could you explain it again please? Meanwhile I am doing my duty and reading your OLS slides etc. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Friday 22 February 2008 04:48, David Howells wrote: > > But looking up the object in the cache should be nearly free - much less > > than a microsecond per block. > > The problem is that you have to do a database lookup of some sort, possibly > involving several synchronous disk operations. Right, so the obvious optimization strategy for this corner of it is to decimate the synchronous disk ops for the average case, for which there are a variety of options, one of which you already suggested. > CacheFiles does a disk lookup by taking the key given to it by NFS, turning it > into a set of file or directory names, and doing a short pathwalk to the > target > cache file. Throwing in extra indices won't necessarily help. What matters > is > how quick the backing filesystem is at doing lookups. As it turns out, Ext3 > is > a fair bit better then BTRFS when the disk cache is cold. All understood. I am eventually going to suggest cutting the backing filesystem entirely out of the picture, with a view to improving both efficiency and transparency, hopefully with a code size reduction as well. But you are up and running with the filesystem approach, enough to tackle the basic algorithm questions, which is worth a lot. I really do not like idea of force fitting this cache into a generic vfs model. Sun was collectively smoking some serious crack when they cooked that one up. But there is also the ageless principle "isness is more important than niceness". > > > The metadata problem is quite a tricky one since it increases with the > > > number of files you're dealing with. As things stand in my patches, when > > > NFS, for example, wants to access a new inode, it first has to go to the > > > server to lookup the NFS file handle, and only then can it go to the cache > > > to find out if there's a matching object in the case. > > > > So without the persistent cache it can omit the LOOKUP and just send the > > filehandle as part of the READ? > > What 'it'? Note that the get the filehandle, you have to do a LOOKUP op. > With > the cache, we could actually cache the results of lookups that we've done, > however, we don't know that the results are still valid without going to the > server:-/ What I was trying to say. It => the cache logic. > AFS has a way around that - it versions its vnode (inode) IDs. Which would require a change to NFS, not an option because you hope to work with standard servers? Of course with years to think about this, the required protocol changes were put into v4. Not. /me hopes for an NFS hack to show up and explain the thinking there Actually, there are many situations where changing both the client (you must do that anyway) and the server is logistically practical. In fact that is true for all actual use cases I know of for this cache model. So elaborating the protocol is not an option to reject out of hand. A hack along those lines could (should?) be provided as an opportunistic option. Have you completely exhausted optimization ideas for the file handle lookup? > > > The reason my client going to my server is so quick is that the server has > > > the dcache and the pagecache preloaded, so that across-network lookup > > > operations are really, really quick, as compared to the synchronous > > > slogging of the local disk to find the cache object. > > > > Doesn't that just mean you have to preload the lookup table for the > > persistent cache so you can determine whether you are caching the data > > for a filehandle without going to disk? > > Where "lookup table" == "dcache". That would be good yes. cachefilesd > prescans all the files in the cache, which ought to do just that, but it > doesn't seem to be very effective. I'm not sure why. RCU? Anyway, it is something to be tracked down and put right. > > Your big can-t-get-there-from-here is the round trip to the server to > > determine whether you should read from the local cache. Got any ideas? > > I'm not sure what you mean. Your statement should probably read "... to > determine _what_ you should read from the local cache". What I tried to say. So still... got any ideas? That extra synchronous network round trip is a killer. Can it be made streaming/async to keep throughput healthy? > > And where is the Trond-meister in all of this? > > Keeping quiet as far as I can tell. /me does the Trond summoning dance Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Thursday 21 February 2008 16:07, David Howells wrote: > The way the client works is like this: Thanks for the excellent ascii art, that cleared up the confusion right away. > What are you trying to do exactly? Are you actually playing with it, or just > looking at the numbers I've produced? Trying to see if you are offering enough of a win to justify testing it, and if that works out, then going shopping for a bin of rotten vegetables to throw at your design, which I hope you will perceive as useful. In short I am looking for a reason to throw engineering effort at it. >From the numbers you have posted I think you are missing some basic efficiencies that could take this design from the sorta-ok zone to wow! I think you may already be in the wow zone for taking load off a server and I know of applications where an NFS server gets hammered so badly that having the client suck a little in the unloaded case is a price worth paying. But the whole idea would be much more attractive if the regressions were smaller. > > Who is supposed to win big? Is this mainly about reducing the load on > > the server, or is the client supposed to win even with a lightly loaded > > server? > > These are difficult questions to answer. The obvious answer to both is "it > depends", and the real answer to both is "it's a compromise". > > Inserting a cache adds overhead: you have to look in the cache to see if your > objects are mirrored there, and then you have to look in the cache to see if > the data you want is stored there; and then you might have to go to the server > anyway and then schedule a copy to be stored in the cache. But looking up the object in the cache should be nearly free - much less than a microsecond per block. If not then there are design issues. I suspect that you are doing yourself a disservice by going all the way through the vfs to do this cache lookup, but this needs to be proved. > The characteristics of this type of cache depend on a number of things: the > filesystem backing it being the most obvious variable, but also how fragmented > it is and the properties of the disk drive or drives it is on. Double caching and vm unawareness of that has to hurt. > The metadata problem is quite a tricky one since it increases with the number > of files you're dealing with. As things stand in my patches, when NFS, for > example, wants to access a new inode, it first has to go to the server to > lookup the NFS file handle, and only then can it go to the cache to find out > if > there's a matching object in the case. So without the persistent cache it can omit the LOOKUP and just send the filehandle as part of the READ? > Worse, the cache must then perform > several synchronous disk bound metadata operations before it can be possible > to > read from the cache. Worse still, this means that a read on the network file > cannot proceed until (a) we've been to the server *plus* (b) we've been to the > disk. > > The reason my client going to my server is so quick is that the server has > the > dcache and the pagecache preloaded, so that across-network lookup operations > are really, really quick, as compared to the synchronous slogging of the local > disk to find the cache object. Doesn't that just mean you have to preload the lookup table for the persistent cache so you can determine whether you are caching the data for a filehandle without going to disk? > I can probably improve this a little by pre-loading the subindex directories > (hash tables) that I use to reduce the directory size in the cache, but I > don't > know by how much. Ah I should have read ahead. I think the correct answer is "a lot". Your big can-t-get-there-from-here is the round trip to the server to determine whether you should read from the local cache. Got any ideas? And where is the Trond-meister in all of this? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Hi David, I am trying to spot the numbers that show the sweet spot for this optimization, without much success so far. Who is supposed to win big? Is this mainly about reducing the load on the server, or is the client supposed to win even with a lightly loaded server? When you say Ext3 cache vs NFS cache is the first on the server and the second on the client? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Hi David, On Wednesday 20 February 2008 08:05, David Howells wrote: > These patches add local caching for network filesystems such as NFS. Have you got before/after benchmark results? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature
On Friday 25 January 2008 05:33, Theodore Tso wrote: > and then detect the > deadlock case where the process holding the file descriptor used to > freeze the filesystem gets frozen because it attempted to write to the > filesystem --- at which point it gets some kind of signal (which > defaults to killing the process), and the filesystem is unfrozen and > as part of the unfreeze you wake up all of the processes that were put > to sleep for touching the frozen filesystem. Hi Ted, There are a few holes: * The process may try to handle the signal and end up blocking on the filesystem again. * The process might pass the fd to another process by forking or fd passing. * The process holding the fd might be trying to take a lock held by another process that is blocked on the filesystem, and infinite variations on that theme. Remembering the task that did the ioctl might work out better than remembering the fd. Or just not try to be so fancy and rely on the application to take appropriate measures to ensure it will not access the filesystem, such as memlocking and not execing. The freezer also needs to run in PF_MEMALLOC mode or similar unless it can be sure it will not cause pageout to the frozen filesystem under low memory conditions. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Jan 17, 2008 7:29 AM, Szabolcs Szakacsits <[EMAIL PROTECTED]> wrote: > Similarly to ZFS, Windows Server 2008 also has self-healing NTFS: I guess that is enough votes to justify going ahead and trying an implementation of the reverse mapping ideas I posted. But of course more votes for this is better. If online incremental fsck is something people want, then please speak up here and that will very definitely help make it happen. On the walk-before-run principle, it would initially just be filesystem checking, not repair. But even this would help, by setting per-group checked flags that offline fsck could use to do a much quicker repair pass. And it will let you know when a volume needs to be taken offline without having to build in planned downtime just in case, which already eats a bunch of nines. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Jan 17, 2008 1:25 PM, Chris mason <[EMAIL PROTECTED]> wrote: > So, I've put v0.11 out there. It fixes those two problems and will also > compile on older (2.6.18) enterprise kernels. > > v0.11 does not have any disk format changes. Hi Chris, First, massive congratulations for bringing this to fruition in such a short time. Now back to the regular carping: why even support older kernels? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
Hi Pavel, Along with this effort, could you let me know if the world actually cares about online fsck? Now we know how to do it I think, but is it worth the effort. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Jan 15, 2008 7:15 PM, Alan Cox <[EMAIL PROTECTED]> wrote: > > Writeback cache on disk in iteself is not bad, it only gets bad if the > > disk is not engineered to save all its dirty cache on power loss, > > using the disk motor as a generator or alternatively a small battery. > > It would be awfully nice to know which brands fail here, if any, > > because writeback cache is a big performance booster. > > AFAIK no drive saves the cache. The worst case cache flush for drives is > several seconds with no retries and a couple of minutes if something > really bad happens. > > This is why the kernel has some knowledge of barriers and uses them to > issue flushes when needed. Indeed, you are right, which is supported by actual measurements: http://sr5tech.com/write_back_cache_experiments.htm Sorry for implying that anybody has engineered a drive that can do such a nice thing with writeback cache. The "disk motor as a generator" tale may not be purely folklore. When an IDE drive is not in writeback mode, something special needs to done to ensure the last write to media is not a scribble. A small UPS can make writeback mode actually reliable, provided the system is smart enough to take the drives out of writeback mode when the line power is off. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Jan 15, 2008 6:07 PM, Pavel Machek <[EMAIL PROTECTED]> wrote: > I had write cache enabled on my main computer. Oops. I guess that > means we do need better documentation. Writeback cache on disk in iteself is not bad, it only gets bad if the disk is not engineered to save all its dirty cache on power loss, using the disk motor as a generator or alternatively a small battery. It would be awfully nice to know which brands fail here, if any, because writeback cache is a big performance booster. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Incremental fsck
Hi Ted, On Saturday 12 January 2008 06:51, Theodore Tso wrote: > What is very hard to check is whether or not the link count on the > inode is correct. Suppose the link count is 1, but there are > actually two directory entries pointing at it. Now when someone > unlinks the file through one of the directory hard entries, the link > count will go to zero, and the blocks will start to get reused, even > though the inode is still accessible via another pathname. Oops. > Data Loss. > > This is why doing incremental, on-line fsck'ing is *hard*. You're > not going to find this while doing each directory one at a time, and > if the filesystem is changing out from under you, it gets worse. And > it's not just the hard link count. There is a similar issue with the > block allocation bitmap. Detecting the case where two files are > simultaneously can't be done if you are doing it incrementally, and > if the filesystem is changing out from under you, it's impossible, > unless you also have the filesystem telling you every single change > while it is happening, and you keep an insane amount of bookkeeping. In this case I am listening to Chicken Little carefully and really do believe the sky will fall if we fail to come up with an incremental online fsck some time in the next few years. I realize the challenge verges on insane, but I have been slowly chewing away at this question for some time. Val proposes to simplify the problem by restricting the scope of block pointers and hard links. Best of luck with that, the concept of fault isolation domains has a nice ring to it. I prefer to stick close to tried and true Ext3 and not change the basic algorithms. Rather than restricting pointers, I propose to add a small amount of new metadata to accelerate global checking. The idea is to be able to build per-group reverse maps very quickly, to support mapping physical blocks back to inodes that own them, and mapping inodes back to the directories that reference them. I see on-the-fly filesystem reverse mapping as useful for more than just online fsck. For example it would be nice to be able to work backwards efficiently from a list of changed blocks such as ddsnap produces to a list of file level changes. The amount of metadata required to support efficient on-the-fly reverse mapping is surprisingly small: 2K per block group per terabyte, in a fixed location at the base of each group. This is consistent with my goal of producing code that is mergable for Ext4 and backportable to Ext3. Building a block reverse map for a given group is easy and efficient. The first pass walks across the inode table and already maps most of the physical blocks for typical usage patterns, because most files only have direct pointers. Index blocks discovered in the first pass go onto a list to be processed by subsequent passes, which may discover additional index blocks. Just keep pushing the index blocks back onto the list and the algorithm terminates when the list is empty. This builds a reverse map for the group including references to external groups. Note that the recent metadata clustering patch from Abhishek Rai will speed up this group mapping algorithm significantly because (almost) all the index blocks can be picked up in one linear read. This should only take a few milliseconds. One more reason why I think his patch is an Important Patch[tm]. A data block may be up to four groups removed from its home group, therefore the reverse mapping process must follow pointers across groups and map each file entirely to be sure that all pointers to the group being checked have been discovered. It is possible to construct a case where a group contains a lot of inodes of big files that are mostly stored in other groups. Mapping such a group could possibly require examining all the index blocks on the entire volume. That would be about 2**18 index blocks per terabyte, which is still within the realm of practicality. To generate the inode reverse map for, walk each directory in the group, decoding the index blocks by hand. Strictly speaking, directories ought to pass block level checking before being reverse mapped, but there could be many directories in the same group spilling over into a lot of external groups, so getting all the directory inodes to pass block level checks at the same time could be difficult with filesystem writing going on between fsck episodes. Instead, just go ahead and assume a directory file is ok, and if this is not the case the directory walk will fail or a block level check will eventually pick up the problem. The worst case for directory mapping is much worse than the worst case for block mapping. A single directory could fill an entire volume. For such a large directory, reverse mapping is not possible without keeping the filesystem suspended for an unreasonable time. Either make the reverse map incremental and maintained on the fly
Re: [RFD] Incremental fsck
On Wednesday 09 January 2008 01:16, Andreas Dilger wrote: > While an _incremental_ fsck isn't so easy for existing filesystem > types, what is pretty easy to automate is making a read-only snapshot > of a filesystem via LVM/DM and then running e2fsck against that. The > kernel and filesystem have hooks to flush the changes from cache and > make the on-disk state consistent. > > You can then set the the ext[234] superblock mount count and last > check time via tune2fs if all is well, or schedule an outage if there > are inconsistencies found. > > There is a copy of this script at: > http://osdir.com/ml/linux.lvm.devel/2003-04/msg1.html > > Note that it might need some tweaks to run with DM/LVM2 > commands/output, but is mostly what is needed. You can do this now with ddsnap (an out-of-tree device mapper target) either by checking a local snapshot or a replicated snapshot on a different machine, see: http://zumastor.org/ Doing the check on a remote machine seems attractive because the fsck does not create a load on the server. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
Hi Jens, This looks really useful. On Wednesday 09 January 2008 00:52, Jens Axboe wrote: > Disadvantages: > > - The file block mappings must not change while loop is using the > file. This means that we have to ensure exclusive access to the file > and this is the bit that is currently missing in the implementation. > It would be nice if we could just do this via open(), ideas > welcome... Get_block methods are pretty fast and you have caching in the level above you, so you might be able to get away with no cache of physical addresses at all, in which case you just need i_mutex and i_alloc_sem at get_block time. This would save a pile of code and still have the main benefit of avoiding double caching. If you use ->get_block instead of bmap, it will fill in file holes for you, but of course get_block is not exposed, and Al is likely to bark at anyone who exposes it. Instead of exposing get_block you could expose an aops method like ->bio_transfer that would hide the use of *_get_block in a library routine, just as __blockdev_direct_IO does. Chances are, there are other users besides loop that would be interested in a generic way of performing bio transfers to files. I presume you would fall back to the existing approach for any filesystem without get_block. You could handle this transparently with a default library method that does read/write. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c
On 10/28/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: > While you're at it, it's probably worth splitting this out into > a small helper function. Why? Is the same pattern called from more than one place? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote: > On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: > > Resubmitting a bio or submitting a dependent bio from > > inside a block driver does not need to be throttled because all > > resources required to guarantee completion must have been obtained > > _before_ the bio was allowed to proceed into the block layer. > > I'm toying with the idea of keeping track of the maximum device stack > depth for each stacked device, and only permitting it to increase in > controlled circumstances. Hi Alasdair, What kind of circumstances did you have in mind? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: > Then, if of course you will want, which I doubt, you can reread > previous mails and find that it was pointed to that race and > possibilities to solve it way too long ago. What still bothers me about your response is that, while you know the race exists and do not disagree with my example, you don't seem to see that that race can eventually lock up the block device by repeatedly losing throttle counts which are never recovered. What prevents that? > > --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 > > -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 > > -0700 @@ -3237,6 +3237,15 @@ end_io: > > */ > > void generic_make_request(struct bio *bio) > > { > > + struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > + > > + if (q && q->metric) { > > + int need = bio->bi_reserved = q->metric(bio); > > + bio->queue = q; > > In case you have stacked device, this entry will be rewritten and you > will lost all your account data. It is a weakness all right. Well, - if (q && q->metric) { + if (q && q->metric && !bio->queue) { which fixes that problem. Maybe there is a better fix possible. Thanks for the catch! The original conception was that this block throttling would apply only to the highest level submission of the bio, the one that crosses the boundary between filesystem (or direct block device application) and block layer. Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. The other principle we are trying to satisfy is that the throttling should not be released until bio->endio, which I am not completely sure about with the patch as modified above. Your earlier idea of having the throttle protection only cover the actual bio submission is interesting and may be effective in some cases, in fact it may cover the specific case of ddsnap. But we don't have to look any further than ddraid (distributed raid) to find a case it doesn't cover - the additional memory allocated to hold parity data has to be reserved until parity data is deallocated, long after the submission completes. So while you manage to avoid some logistical difficulties, it also looks like you didn't solve the general problem. Hopefully I will be able to report on whether my patch actually works soon, when I get back from vacation. The mechanism in ddsnap this is supposed to replace is effective, it is just ugly and tricky to verify. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) > wrote: > > > We do not care about one cpu being able to increase its counter > > > higher than the limit, such inaccuracy (maximum bios in flight > > > thus can be more than limit, difference is equal to the number of > > > CPUs - 1) is a price for removing atomic operation. I thought I > > > pointed it in the original description, but might forget, that if > > > it will be an issue, that atomic operations can be introduced > > > there. Any uber-precise measurements in the case when we are > > > close to the edge will not give us any benefit at all, since were > > > are already in the grey area. > > > > This is not just inaccurate, it is suicide. Keep leaking throttle > > counts and eventually all of them will be gone. No more IO > > on that block device! > > First, because number of increased and decreased operations are the > same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. > Let's solve problems in order of their appearence. If bio structure > will be allowed to grow, then the whole patches can be done better. How about like the patch below. This throttles any block driver by implementing a throttle metric method so that each block driver can keep track of its own resource consumption in units of its choosing. As an (important) example, it implements a simple metric for device mapper devices. Other block devices will work as before, because they do not define any metric. Short, sweet and untested, which is why I have not posted it until now. This patch originally kept its accounting info in backing_dev_info, however that structure seems to be in some and it is just a part of struct queue anyway, so I lifted the throttle accounting up into struct queue. We should be able to report on the efficacy of this patch in terms of deadlock prevention pretty soon. --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + + if (q && q->metric) { + int need = bio->bi_reserved = q->metric(bio); + bio->queue = q; + wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need); + atomic_sub(&q->available, need); + } + if (current->bio_tail) { /* make_request is active */ *(current->bio_tail) = bio; --- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.0 -0700 @@ -880,6 +880,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio->bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md->queue->queuedata = md; + md->queue->metric = dm_metric; + atomic_set(&md->queue->available, md->queue->capacity = 1000); + init_waitqueue_head(&md->queue->throttle_wait); + md->queue->backing_dev_info.congested_fn = dm_any_congested; md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); --- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700 @@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned bytes_done = bio->bi_size; } - bio->bi_size -= bytes_done; + if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) { + struct request_queue *q = bio->queue; + atomic_add(&q->available, bio->bi_reserved); + bio->bi_reserved = 0; /* just in case */ + wake_up(&q->throttle_wait); + } bio->bi_sector += (bytes_done >> 9); if (bio->bi_end_io) --- 2.6.22
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote: > On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > Say Evgeniy, something I was curious about but forgot to ask you > > earlier... > > > > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > > > ...All oerations are not atomic, since we do not care about > > > precise number of bios, but a fact, that we are close or close > > > enough to the limit. > > > ... in bio->endio > > > + q->bio_queued--; > > > > In your proposed patch, what prevents the race: > > > > cpu1cpu2 > > > > read q->bio_queued > > > > q->bio_queued-- > > write q->bio_queued - 1 > > Whoops! We leaked a throttle count. > > We do not care about one cpu being able to increase its counter > higher than the limit, such inaccuracy (maximum bios in flight thus > can be more than limit, difference is equal to the number of CPUs - > 1) is a price for removing atomic operation. I thought I pointed it > in the original description, but might forget, that if it will be an > issue, that atomic operations can be introduced there. Any > uber-precise measurements in the case when we are close to the edge > will not give us any benefit at all, since were are already in the > grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! > Another possibility is to create a queue/device pointer in the bio > structure to hold original device and then in its backing dev > structure add a callback to recalculate the limit, but it increases > the size of the bio. Do we need this? Different issue. Yes, I think we need a nice simple approach like that, and prove it is stable before worrying about the size cost. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > ...All oerations are not atomic, since we do not care about precise > number of bios, but a fact, that we are close or close enough to the > limit. > ... in bio->endio > + q->bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q->bio_queued q->bio_queued-- write q->bio_queued - 1 Whoops! We leaked a throttle count. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote: > > The throttling of the virtual device must begin in > > generic_make_request and last to ->endio. You release the throttle > > of the virtual device at the point you remap the bio to an > > underlying device, which you have convinced yourself is ok, but it > > is not. You seem to miss the fact that whatever resources the > > virtual device has allocated are no longer protected by the > > throttle count *of the virtual device*, or you do not > > Because it is charged to another device. Great. You charged the resource to another device, but you did not limit the amount of resources that the first device can consume. Which misses the whole point. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote: > On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > > > And it will not solve the deadlock problem in general. (Maybe > > > > it works for your virtual device, but I wonder...) If the > > > > virtual device allocates memory during generic_make_request > > > > then the memory needs to be throttled. > > > > > > Daniel, if device process bio by itself, it has a limit and thus > > > it will wait in generic_make_request() > > > > What will make it wait? > > gneric_make_request() for given block device. Not good enough, that only makes one thread wait. Look here: http://lkml.org/lkml/2007/8/13/788 An unlimited number of threads can come in, each consuming resources of the virtual device, and violating the throttling rules. The throttling of the virtual device must begin in generic_make_request and last to ->endio. You release the throttle of the virtual device at the point you remap the bio to an underlying device, which you have convinced yourself is ok, but it is not. You seem to miss the fact that whatever resources the virtual device has allocated are no longer protected by the throttle count *of the virtual device*, or you do not see why that is a bad thing. It is a very bad thing, roughly like leaving some shared data outside a spin_lock/unlock. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > And it will not solve the deadlock problem in general. (Maybe it > > works for your virtual device, but I wonder...) If the virtual > > device allocates memory during generic_make_request then the memory > > needs to be throttled. > > Daniel, if device process bio by itself, it has a limit and thus it > will wait in generic_make_request() What will make it wait? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > Perhaps you never worried about the resources that the device > > mapper mapping function allocates to handle each bio and so did not > > consider this hole significant. These resources can be > > significant, as is the case with ddsnap. It is essential to close > > that window through with the virtual device's queue limit may be > > violated. Not doing so will allow deadlock. > > This is not a bug, this is special kind of calculation - total limit > is number of physical devices multiplied by theirs limits. It was > done _on purpose_ to allow different device to have different limits > (for example in distributed storage project it is possible to have > both remote and local node in the same device, but local device > should not have _any_ limit at all, but network one should). > > Virtual device essentially has _no_ limit. And that as done on > purpose. And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:12, Jens Axboe wrote: > > It is a system wide problem. Every block device needs throttling, > > otherwise queues expand without limit. Currently, block devices > > that use the standard request library get a slipshod form of > > throttling for free in the form of limiting in-flight request > > structs. Because the amount of IO carried by a single request can > > vary by two orders of magnitude, the system behavior of this > > approach is far from predictable. > > Is it? Consider just 10 standard sata disks. The next kernel revision > will have sg chaining support, so that allows 32MiB per request. Even > if we disregard reads (not so interesting in this discussion) and > just look at potentially pinned dirty data in a single queue, that > number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch. > > So I still think that this throttling needs to happen elsewhere, you > cannot rely the block layer throttling globally or for a single > device. It just doesn't make sense. You are right, so long as the unit of throttle accounting remains one request. This is not what we do in ddsnap. Instead we inc/dec the throttle counter by the number of bvecs in each bio, which produces a nice steady data flow to the disk under a wide variety of loads, and provides the memory resource bound we require. One throttle count per bvec will not be the right throttling metric for every driver. To customize this accounting metric for a given driver we already have the backing_dev_info structure, which provides per-device-instance accounting functions and instance data. Perfect! This allows us to factor the throttling mechanism out of the driver, so the only thing the driver has to do is define the throttle accounting if it needs a custom one. We can avoid affecting the traditional behavior quite easily, for example if backing_dev_info->throttle_fn (new method) is null then either not throttle at all (and rely on the struct request in-flight limit) or we can move the in-flight request throttling logic into core as the default throttling method, simplifying the request library and not changing its behavior. > > These deadlocks are first and foremost, block layer deficiencies. > > Even the network becomes part of the problem only because it lies > > in the block IO path. > > The block layer has NEVER guaranteed throttling, so it can - by > definition - not be a block layer deficiency. The block layer has always been deficient by not providing accurate throttling, or any throttling at all for some devices. We have practical proof that this causes deadlock and a good theoretical basis for describing exactly how it happens. To be sure, vm and net are co-conspirators, however the block layer really is the main actor in this little drama. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote: > > Say you have a device mapper device with some physical device > > sitting underneath, the classic use case for this throttle code. > > Say 8,000 threads each submit an IO in parallel. The device mapper > > mapping function will be called 8,000 times with associated > > resource allocations, regardless of any throttling on the physical > > device queue. > > Each thread will sleep in generic_make_request(), if limit is > specified correctly, then allocated number of bios will be enough to > have a progress. The problem is, the sleep does not occur before the virtual device mapping function is called. Let's consider two devices, a physical device named pdev and a virtual device sitting on top of it called vdev. vdev's throttle limit is just one element, but we will see that in spite of this, two bios can be handled by the vdev's mapping method before any IO completes, which violates the throttling rules. According to your patch it works like this: Thread 1Thread 2 bio_queued is zero> vdev->q->bio_queued++ blk_set_bdev(bio, pdev) vdev->bio_queued-- bio_queued is zero> vdev->q->bio_queued++ whoops! Our virtual device mapping function has now allocated resources for two in-flight bios in spite of having its throttle limit set to 1. Perhaps you never worried about the resources that the device mapper mapping function allocates to handle each bio and so did not consider this hole significant. These resources can be significant, as is the case with ddsnap. It is essential to close that window through with the virtual device's queue limit may be violated. Not doing so will allow deadlock. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: > > > > Oops, and there is also: > > > > > > > > 3) The bio throttle, which is supposed to prevent deadlock, can > > > > itself deadlock. Let me see if I can remember how it goes. > > > > > > > > * generic_make_request puts a bio in flight > > > > * the bio gets past the throttle and initiates network IO > > > > * net calls sk_alloc->alloc_pages->shrink_caches > > > > * shrink_caches submits a bio recursively to our block device > > > > * this bio blocks on the throttle > > > > * net may never get the memory it needs, and we are wedged > > > > > > If system is in such condition, it is already broken - throttle > > > limit must be lowered (next time) not to allow such situation. > > > > Agreed that the system is broken, however lowering the throttle > > limit gives no improvement in this case. > > How is it ever possible? The whole idea of throttling is to remove > such situation, and now you say it can not be solved. It was solved, by not throttling writeout that comes from shrink_caches. Ugly. > If limit is for > 1gb of pending block io, and system has for example 2gbs of ram (or > any other resonable parameters), then there is no way we can deadlock > in allocation, since it will not force page reclaim mechanism. The problem is that sk_alloc (called from our block driver via socket->write) would recurse into shrink_pages, which recursively submits IO to our block driver and blocks on the throttle. Subtle indeed, and yet another demonstration of why vm recursion is a Bad Thing. I will find a traceback for you tomorrow, which makes this deadlock much clearer. Regards - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 04:03, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > > This is not a very good solution, since it requires all users of > > > the bios to know how to free it. > > > > No, only the specific ->endio needs to know that, which is set by > > the bio owner, so this knowledge lies in exactly the right place. > > A small handful of generic endios all with the same destructor are > > used nearly everywhere. > > That is what I meant - there will be no way to just alloc a bio and > put it, helpers for generic bio sets must be exported and each and > every bi_end_io() must be changed to check reference counter and they > must know how they were allocated. There are fewer non-generic bio allocators than you think. > Endio callback is of course quite rare and additional atomic > reading will not kill the system, but why introduce another read? > It is possible to provide a flag for endio callback that it is last, > but it still requires to change every single callback - why do we > want this? We don't. Struct bio does not need to be shrunk. Jens wanted to talk about what fields could be eliminated if we wanted to shrink it. It is about time to let that lie, don't you think? > So, I'm a bit lost... > > You say it is too big Did not say that. > and some parts can be removed or combined True. > and then that size does not matter. Also true, backed up by numbers on real systems. > Last/not-last checks in the code is > not clear design, so I do not see why it is needed at all if not for > size shrinking. Not needed, indeed. Accurate throttling is needed. If the best way to throttle requires expanding struct bio a little then we should not let concerns about the cost of an int or two stand in the way. Like Jens, I am more concerned about the complexity cost, and that is minimized in my opinion by throttling in the generic code rather than with custom code in each specialized block driver. Your patch does throttle in the generic code, great. Next thing is to be sure that it completely closes the window for reserve leakage, which is not yet clear. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote: > On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > (previous incomplete message sent accidentally) > > > > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: > > > > > > So, what did we decide? To bloat bio a bit (add a queue pointer) > > > or to use physical device limits? The latter requires to replace > > > all occurence of bio->bi_bdev = something_new with > > > blk_set_bdev(bio, somthing_new), where queue limits will be > > > appropriately charged. So far I'm testing second case, but I only > > > changed DST for testing, can change all other users if needed > > > though. > > > > Adding a queue pointer to struct bio and using physical device > > limits as in your posted patch both suffer from the same problem: > > you release the throttling on the previous queue when the bio moves > > to a new one, which is a bug because memory consumption on the > > previous queue then becomes unbounded, or limited only by the > > number of struct requests that can be allocated. In other words, > > it reverts to the same situation we have now as soon as the IO > > stack has more than one queue. (Just a shorter version of my > > previous post.) > > No. Since all requests for virtual device end up in physical devices, > which have limits, this mechanism works. Virtual device will > essentially call either generic_make_request() for new physical > device (and thus will sleep is limit is over), or will process bios > directly, but in that case it will sleep in generic_make_request() > for virutal device. What can happen is, as soon as you unthrottle the previous queue, another thread can come in and put another request on it. Sure, that thread will likely block on the physical throttle and so will the rest of the incoming threads, but it still allows the higher level queue to grow past any given limit, with the help of lots of threads. JVM for example? Say you have a device mapper device with some physical device sitting underneath, the classic use case for this throttle code. Say 8,000 threads each submit an IO in parallel. The device mapper mapping function will be called 8,000 times with associated resource allocations, regardless of any throttling on the physical device queue. Anyway, your approach is awfully close to being airtight, there is just a small hole. I would be more than happy to be proved wrong about that, but the more I look, the more I see that hole. > > 1) One throttle count per submitted bio is too crude a measure. A > > bio can carry as few as one page or as many as 256 pages. If you > > take only > > It does not matter - we can count bytes, pages, bio vectors or > whatever we like, its just a matter of counter and can be changed > without problem. Quite true. In some cases the simple inc/dec per bio works just fine. But the general case where finer granularity is required comes up in existing code, so there needs to be a plan. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: > > Oops, and there is also: > > > > 3) The bio throttle, which is supposed to prevent deadlock, can > > itself deadlock. Let me see if I can remember how it goes. > > > > * generic_make_request puts a bio in flight > > * the bio gets past the throttle and initiates network IO > > * net calls sk_alloc->alloc_pages->shrink_caches > > * shrink_caches submits a bio recursively to our block device > > * this bio blocks on the throttle > > * net may never get the memory it needs, and we are wedged > > If system is in such condition, it is already broken - throttle limit > must be lowered (next time) not to allow such situation. Agreed that the system is broken, however lowering the throttle limit gives no improvement in this case. This is not theoretical, but a testable, repeatable result. Instructions to reproduce should show up tomorrow. This bug is now solved in a kludgy way. Now, Peter's patch set offers a much cleaner way to fix this little problem, along with at least one other nasty that it already fixed. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 03:22, Jens Axboe wrote: > I never compared the bio to struct page, I'd obviously agree that > shrinking struct page was a worthy goal and that it'd be ok to uglify > some code to do that. The same isn't true for struct bio. I thought I just said that. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 03:06, Jens Axboe wrote: > On Mon, Aug 13 2007, Daniel Phillips wrote: > > Of course not. Nothing I said stops endio from being called in the > > usual way as well. For this to work, endio just needs to know that > > one call means "end" and the other means "destroy", this is > > trivial. > > Sorry Daniel, but your suggestions would do nothing more than uglify > the code and design. Pretty much exactly what was said about shrinking struct page, ask Bill. The difference was, shrinking struct page actually mattered whereas shrinking struct bio does not, and neither does expanding it by a few bytes. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:18, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > > But that idea fails as well, since reference counts and IO > > > completion are two completely seperate entities. So unless end IO > > > just happens to be the last user holding a reference to the bio, > > > you cannot free it. > > > > That is not a problem. When bio_put hits zero it calls ->endio > > instead of the destructor. The ->endio sees that the count is zero > > and destroys the bio. > > This is not a very good solution, since it requires all users of the > bios to know how to free it. No, only the specific ->endio needs to know that, which is set by the bio owner, so this knowledge lies in exactly the right place. A small handful of generic endios all with the same destructor are used nearly everywhere. > Right now it is hidden. > And adds additional atomic check (although reading is quite fast) in > the end_io. Actual endio happens once in the lifetime of the transfer, this read will be entirely lost in the noise. > And for what purpose? To eat 8 bytes on 64bit platform? > This will not reduce its size noticebly, so the same number of bios > will be in the cache's page, so what is a gain? All this cleanups and > logic complicatins should be performed only if after size shring > increased number of bios can fit into cache's page, will it be done > after such cleanups? Well, exactly, My point from the beginning was that the size of struct bio is not even close to being a problem and adding a few bytes to it in the interest of doing the cleanest fix to a core kernel bug is just not a dominant issue. I suppose that leaving out the word "bloated" and skipping straight to the "doesn't matter" proof would have saved some bandwidth. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:13, Jens Axboe wrote: > On Mon, Aug 13 2007, Daniel Phillips wrote: > > On Monday 13 August 2007 00:45, Jens Axboe wrote: > > > On Mon, Aug 13 2007, Jens Axboe wrote: > > > > > You did not comment on the one about putting the bio > > > > > destructor in the ->endio handler, which looks dead simple. > > > > > The majority of cases just use the default endio handler and > > > > > the default destructor. Of the remaining cases, where a > > > > > specialized destructor is needed, typically a specialized > > > > > endio handler is too, so combining is free. There are few if > > > > > any cases where a new specialized endio handler would need to > > > > > be written. > > > > > > > > We could do that without too much work, I agree. > > > > > > But that idea fails as well, since reference counts and IO > > > completion are two completely seperate entities. So unless end IO > > > just happens to be the last user holding a reference to the bio, > > > you cannot free it. > > > > That is not a problem. When bio_put hits zero it calls ->endio > > instead of the destructor. The ->endio sees that the count is zero > > and destroys the bio. > > You can't be serious? You'd stall end io completion notification > because someone holds a reference to a bio. Of course not. Nothing I said stops endio from being called in the usual way as well. For this to work, endio just needs to know that one call means "end" and the other means "destroy", this is trivial. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 00:45, Jens Axboe wrote: > On Mon, Aug 13 2007, Jens Axboe wrote: > > > You did not comment on the one about putting the bio destructor > > > in the ->endio handler, which looks dead simple. The majority of > > > cases just use the default endio handler and the default > > > destructor. Of the remaining cases, where a specialized > > > destructor is needed, typically a specialized endio handler is > > > too, so combining is free. There are few if any cases where a > > > new specialized endio handler would need to be written. > > > > We could do that without too much work, I agree. > > But that idea fails as well, since reference counts and IO completion > are two completely seperate entities. So unless end IO just happens > to be the last user holding a reference to the bio, you cannot free > it. That is not a problem. When bio_put hits zero it calls ->endio instead of the destructor. The ->endio sees that the count is zero and destroys the bio. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 00:28, Jens Axboe wrote: > On Sun, Aug 12 2007, Daniel Phillips wrote: > > Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can > > derive efficiently from BIO_POOL_IDX() provided the bio was > > allocated in the standard way. > > That would only be feasible, if we ruled that any bio in the system > must originate from the standard pools. Not at all. > > This leaves a little bit of clean up to do for bios not allocated > > from a standard pool. > > Please suggest how to do such a cleanup. Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a bio from the standard pool. Just put the power of two size in the bits and map that number to the standard pool arrangement with a table lookup. > > On the other hand, vm writeout deadlock ranks smack dab at the top > > of the list, so that is where the patching effort must go for the > > forseeable future. Without bio throttling, the ddsnap load can go > > to 24 MB for struct bio alone. That definitely moves the needle. > > in short, we save 3,200 times more memory by putting decent > > throttling in place than by saving an int in struct bio. > > Then fix the damn vm writeout. I always thought it was silly to > depend on the block layer for any sort of throttling. If it's not a > system wide problem, then throttle the io count in the > make_request_fn handler of that problematic driver. It is a system wide problem. Every block device needs throttling, otherwise queues expand without limit. Currently, block devices that use the standard request library get a slipshod form of throttling for free in the form of limiting in-flight request structs. Because the amount of IO carried by a single request can vary by two orders of magnitude, the system behavior of this approach is far from predictable. > > You did not comment on the one about putting the bio destructor in > > the ->endio handler, which looks dead simple. The majority of > > cases just use the default endio handler and the default > > destructor. Of the remaining cases, where a specialized destructor > > is needed, typically a specialized endio handler is too, so > > combining is free. There are few if any cases where a new > > specialized endio handler would need to be written. > > We could do that without too much work, I agree. OK, we got one and another is close to cracking, enough of that. > > As far as code stability goes, current kernels are horribly > > unstable in a variety of contexts because of memory deadlock and > > slowdowns related to the attempt to fix the problem via dirty > > memory limits. Accurate throttling of bio traffic is one of the > > two key requirements to fix this instability, the other other is > > accurate writeout path reserve management, which is only partially > > addressed by BIO_POOL. > > Which, as written above and stated many times over the years on lkml, > is not a block layer issue imho. Whoever stated that was wrong, but this should be no surprise. There have been many wrong things said about this particular bug over the years. The one thing that remains constant is, Linux continues to deadlock under a variety of loads both with and without network involvement, making it effectively useless as a storage platform. These deadlocks are first and foremost, block layer deficiencies. Even the network becomes part of the problem only because it lies in the block IO path. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Sunday 12 August 2007 22:36, I wrote: > Note! There are two more issues I forgot to mention earlier. Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc->alloc_pages->shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged I need to review a backtrace to get this precisely right, however you can see the danger. In ddsnap we kludge around this problem by not throttling any bio submitted in PF_MEMALLOC mode, which effectively increases our reserve requirement by the amount of IO that mm will submit to a given block device before deciding the device is congested and should be left alone. This works, but is sloppy and disgusting. The right thing to do is to make sure than the mm knows about our throttle accounting in backing_dev_info so it will not push IO to our device when it knows that the IO will just block on congestion. Instead, shrink_caches will find some other less congested block device or give up, causing alloc_pages to draw from the memalloc reserve to satisfy the sk_alloc request. The mm already uses backing_dev_info this way, we just need to set the right bits in the backing_dev_info state flags. I think Peter posted a patch set that included this feature at some point. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
(previous incomplete message sent accidentally) On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: > > So, what did we decide? To bloat bio a bit (add a queue pointer) or > to use physical device limits? The latter requires to replace all > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio, > somthing_new), where queue limits will be appropriately charged. So > far I'm testing second case, but I only changed DST for testing, can > change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. 2) Exposing the per-block device throttle limits via sysfs or similar is really not a good long term solution for system administration. Imagine our help text: "just keep trying smaller numbers until your system deadlocks". We really need to figure this out internally and get it correct. I can see putting in a temporary userspace interface just for experimentation, to help determine what really is safe, and what size the numbers should be to approach optimal throughput in a fully loaded memory state. Regards, Daniel Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: > > So, what did we decide? To bloat bio a bit (add a queue pointer) or > to use physical device limits? The latter requires to replace all > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio, > somthing_new), where queue limits will be appropriately charged. So > far I'm testing second case, but I only changed DST for testing, can > change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. the throttling performs another function: keeping a reasonable amount of IO in flight for the device. The definition of "reasonable" is complex. For a hard disk it depends on the physical distance between sector addresses of the bios in flight. In ddsnap we make a crude but workable approximation that In general, a per block device The throttle count needs to cover Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 07 August 2007 13:55, Jens Axboe wrote: > I don't like structure bloat, but I do like nice design. Overloading > is a necessary evil sometimes, though. Even today, there isn't enough > room to hold bi_rw and bi_flags in the same variable on 32-bit archs, > so that concern can be scratched. If you read bio.h, that much is > obvious. Sixteen bits in bi_rw are consumed by queue priority. Is there a reason this lives in struct bio instead of struct request? > If you check up on the iommu virtual merging, you'll understand the > front and back size members. They may smell dubious to you, but > please take the time to understand why it looks the way it does. Virtual merging is only needed at the physical device, so why do these fields live in struct bio instead of struct request? > Changing the number of bvecs is integral to how bio buildup current > works. Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can derive efficiently from BIO_POOL_IDX() provided the bio was allocated in the standard way. This leaves a little bit of clean up to do for bios not allocated from a standard pool. Incidentally, why does the bvl need to be memset to zero on allocation? bi_vcnt already tells you which bvecs are valid and the only field in a bvec that can reasonably default to zero is the offset, which ought to be set set every time a bvec is initialized anyway. > > bi_destructor could be combined. I don't see a lot of users of > > bi_idx, > > bi_idx is integral to partial io completions. Struct request has a remaining submission sector count so what does bi_idx do that is different? > > that looks like a soft target. See what happened to struct page > > when a couple of folks got serious about attacking it, some really > > deep hacks were done to pare off a few bytes here and there. But > > struct bio as a space waster is not nearly in the same ballpark. > > So show some concrete patches and examples, hand waving and > assumptions is just a waste of everyones time. Average struct bio memory footprint ranks near the bottom of the list of things that suck most about Linux storage. At idle I see 8K in use (reserves); during updatedb it spikes occasionally to 50K; under a heavy load generated by ddsnap on a storage box it sometimes goes to 100K with bio throttling in place. Really not moving the needle. On the other hand, vm writeout deadlock ranks smack dab at the top of the list, so that is where the patching effort must go for the forseeable future. Without bio throttling, the ddsnap load can go to 24 MB for struct bio alone. That definitely moves the needle. in short, we save 3,200 times more memory by putting decent throttling in place than by saving an int in struct bio. That said, I did a little analysis to get an idea of where the soft targets are in struct bio, and to get to know the bio layer a little better. Maybe these few hints will get somebody interested enough to look further. > > It would be interesting to see if bi_bdev could be made read only. > > Generally, each stage in the block device stack knows what the next > > stage is going to be, so why do we have to write that in the bio? > > For error reporting from interrupt context? Anyway, if Evgeniy > > wants to do the patch, I will happily unload the task of convincing > > you that random fields are/are not needed in struct bio :-) > > It's a trade off, otherwise you'd have to pass the block device > around a lot. Which costs very little, probably less than trashing an extra field's worth of cache. > And it's, again, a design issue. A bio contains > destination information, that means device/offset/size information. > I'm all for shaving structure bytes where it matters, but not for the > sake of sacrificing code stability or design. I consider struct bio > quite lean and have worked hard to keep it that way. In fact, iirc, > the only addition to struct bio since 2001 is the iommu front/back > size members. And I resisted those for quite a while. You did not comment on the one about putting the bio destructor in the ->endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. As far as code stability goes, current kernels are horribly unstable in a variety of contexts because of memory deadlock and slowdowns related to the attempt to fix the problem via dirty memory limits. Accurate throttling of bio traffic is one of the two key requirements to fix this instability, the other other is accurate writeout path reserve management, which is only partially addressed by BIO_POOL. Nice to see you jumping in Jens. Now it is over to the other side of the thread where Evgeniy has posted a
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Evgeniy, Sorry for not getting back to you right away, I was on the road with limited email access. Incidentally, the reason my mails to you keep bouncing is, your MTA is picky about my mailer's IP reversing to a real hostname. I will take care of that pretty soon, but for now my direct mail to you is going to bounce and you will only see the lkml copy. On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > This throttling mechanism allows to limit maximum amount of queued > bios per physical device. By default it is turned off and old block > layer behaviour with unlimited number of bios is used. When turned on > (queue limit is set to something different than -1U via > blk_set_queue_limit()), generic_make_request() will sleep until there > is room in the queue. number of bios is increased in > generic_make_request() and reduced either in bio_endio(), when bio is > completely processed (bi_size is zero), and recharged from original > queue when new device is assigned to bio via blk_set_bdev(). All > oerations are not atomic, since we do not care about precise number > of bios, but a fact, that we are close or close enough to the limit. > > Tested on distributed storage device - with limit of 2 bios it works > slow :) it seems to me you need: - if (q) { + if (q && q->bio_limit != -1) { This patch is short and simple, and will throttle more accurately than the current simplistic per-request allocation limit. However, it fails to throttle device mapper devices. This is because no request is allocated by the device mapper queue method, instead the mapping call goes straight through to the mapping function. If the mapping function allocates memory (typically the case) then this resource usage evades throttling and deadlock becomes a risk. There are three obvious fixes: 1) Implement bio throttling in each virtual block device 2) Implement bio throttling generically in device mapper 3) Implement bio throttling for all block devices Number 1 is the approach we currently use in ddsnap, but it is ugly and repetitious. Number 2 is a possibility, but I favor number 3 because it is a system-wide solution to a system-wide problem, does not need to be repeated for every block device that lacks a queue, heads in the direction of code subtraction, and allows system-wide reserve accounting. Your patch is close to the truth, but it needs to throttle at the top (virtual) end of each block device stack instead of the bottom (physical) end. It does head in the direction of eliminating your own deadlock risk indeed, however there are block devices it does not cover. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 07 August 2007 05:05, Jens Axboe wrote: > On Sun, Aug 05 2007, Daniel Phillips wrote: > > A simple way to solve the stable accounting field issue is to add a > > new pointer to struct bio that is owned by the top level submitter > > (normally generic_make_request but not always) and is not affected > > by any recursive resubmission. Then getting rid of that field > > later becomes somebody's summer project, which is not all that > > urgent because struct bio is already bloated up with a bunch of > > dubious fields and is a transient structure anyway. > > Thanks for your insights. Care to detail what bloat and dubious > fields struct bio has? First obvious one I see is bi_rw separate from bi_flags. Front_size and back_size smell dubious. Is max_vecs really necessary? You could reasonably assume bi_vcnt rounded up to a power of two and bury the details of making that work behind wrapper functions to change the number of bvecs, if anybody actually needs that. Bi_endio and bi_destructor could be combined. I don't see a lot of users of bi_idx, that looks like a soft target. See what happened to struct page when a couple of folks got serious about attacking it, some really deep hacks were done to pare off a few bytes here and there. But struct bio as a space waster is not nearly in the same ballpark. It would be interesting to see if bi_bdev could be made read only. Generally, each stage in the block device stack knows what the next stage is going to be, so why do we have to write that in the bio? For error reporting from interrupt context? Anyway, if Evgeniy wants to do the patch, I will happily unload the task of convincing you that random fields are/are not needed in struct bio :-) Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote: > On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote: > > > DST original code worked as device mapper plugin too, but its two > > > additional allocations (io and clone) per block request ended up > > > for me as a show stopper. > > > > Ah, sorry, I misread. A show stopper in terms of efficiency, or in > > terms of deadlock? > > At least as in terms of efficiency. Device mapper lives in happy > world where memory does not end and allocations are fast. Are you saying that things are different for a network block device because it needs to do GFP_ATOMIC allocations? If so then that is just a misunderstanding. The global page reserve Peter and I use is available in interrupt context just like GFP_ATOMIC. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote: > If we are sleeping in memory pool, then we already do not have memory > to complete previous requests, so we are in trouble. Not at all. Any requests in flight are guaranteed to get the resources they need to complete. This is guaranteed by the combination of memory reserve management and request queue throttling. In logical terms, reserve management plus queue throttling is necessary and sufficient to prevent these deadlocks. Conversely, the absence of either one allows deadlock. > This can work > for devices which do not require additional allocations (like usual > local storage), but not for network connected ones. It works for network devices too, and also for a fancy device like ddsnap, which is the moral equivalent of a filesystem implemented in user space. > If not in device, then at least it should say to block layer about > its limits. What about new function to register queue... Yes, a new internal API is needed eventually. However, no new api is needed right at the moment because we can just hard code the reserve sizes and queue limits and audit them by hand, which is not any more sloppy than several other kernel subsystems. The thing is, we need to keep any obfuscating detail out of the initial patches because these principles are hard enough to explain already without burying them in hundreds of lines of API fluff. That said, the new improved API should probably not be a new way to register, but a set of function calls you can use after the queue is created, which follows the pattern of the existing queue API. > ...which will get > maximum number of bios in flight and sleep in generic_make_request() > when new bio is going to be submitted and it is about to exceed the > limit? Exactly. This is what ddsnap currently does and it works. But we did not change generic_make_request for this driver, instead we throttled the driver from the time it makes a request to its user space server, until the reply comes back. We did it that way because it was easy and was the only segment of the request lifeline that could not be fixed by other means. A proper solution for all block devices will move the throttling up into generic_make_request, as you say below. > By default things will be like they are now, except additional > non-atomic increment and branch in generic_make_request() and > decrement and wake in bio_end_io()? ->endio is called in interrupt context, so the accounting needs to be atomic as far as I can see. We actually account the total number of bio pages in flight, otherwise you would need to assume the largest possible bio and waste a huge amount of reserve memory. A counting semaphore works fine for this purpose, with some slight inefficiency that is nigh on unmeasurable in the block IO path. What the semaphore does is make the patch small and easy to understand, which is important at this point. > I can cook up such a patch if idea worth efforts. It is. There are some messy details... You need a place to store the accounting variable/semaphore and need to be able to find that place again in ->endio. Trickier than it sounds, because of the unstructured way drivers rewrite ->bi_bdev. Peterz has already poked at this in a number of different ways, typically involving backing_dev_info, which seems like a good idea to me. A simple way to solve the stable accounting field issue is to add a new pointer to struct bio that is owned by the top level submitter (normally generic_make_request but not always) and is not affected by any recursive resubmission. Then getting rid of that field later becomes somebody's summer project, which is not all that urgent because struct bio is already bloated up with a bunch of dubious fields and is a transient structure anyway. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Saturday 04 August 2007 09:44, Evgeniy Polyakov wrote: > > On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: > > > * storage can be formed on top of remote nodes and be > > > exported simultaneously (iSCSI is peer-to-peer only, NBD requires > > > device mapper and is synchronous) > > > > In fact, NBD has nothing to do with device mapper. I use it as a > > physical target underneath ddraid (a device mapper plugin) just > > like I would use your DST if it proves out. > > I meant to create a storage on top of several nodes one needs to have > device mapper or something like that on top of NBD itself. To further > export resulted device one needs another userspace NDB application > and so on. DST simplifies that greatly. > > DST original code worked as device mapper plugin too, but its two > additional allocations (io and clone) per block request ended up for > me as a show stopper. Ah, sorry, I misread. A show stopper in terms of efficiency, or in terms of deadlock? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote: > On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote: > > To be sure, I am not very proud of this throttling mechanism for > > various reasons, but the thing is, _any_ throttling mechanism no > > matter how sucky solves the deadlock problem. Over time I want to > > move the > > make_request_fn is always called in process context, Yes, as is submit_bio which calls it. The decision re where it is best to throttle, in submit_bio or in make_request_fn, has more to do with system factoring, that is, is throttling something that _every_ block device should have (yes I think) or is it a delicate, optional thing that needs a tweakable algorithm per block device type (no I think). The big worry I had was that by blocking on congestion in the submit_bio/make_request_fn I might stuff up system-wide mm writeout. But a while ago that part of the mm was tweaked (by Andrew if I recall correctly) to use a pool of writeout threads and understand the concept of one of them blocking on some block device, and not submit more writeout to the same block device until the first thread finishes its submission. Meanwhile, other mm writeout threads carry on with other block devices. > we can wait in it for memory in mempool. Although that means we > already in trouble. Not at all. This whole block writeout path needs to be written to run efficiently even when normal system memory is completely gone. All it means when we wait on a mempool is that the block device queue is as full as we are ever going to let it become, and that means the block device is working as hard as it can (subject to a small caveat: for some loads a device can work more efficiently if it can queue up larger numbers of requests down at the physical elevators). By the way, ddsnap waits on a counting semaphore, not a mempool. That is because we draw our reserve memory from the global memalloc reserve, not from a mempool. And that is not only because it takes less code to do so, but mainly because global pools as opposed to lots of little special purpose pools seem like a good idea to me. Though I will admit that with our current scheme we need to allow for the total of the maximum reserve requirements for all memalloc users in the memalloc pool, so it does not actually save any memory vs dedicated pools. We could improve that if we wanted to, by having hard and soft reserve requirements: the global reserve actually only needs to be as big as the total of the hard requirements. With this idea, if by some unlucky accident every single pool user got itself maxed out at the same time, we would still not exceed our share of the global reserve. Under "normal" low memory situations, a block device would typically be free to grab reserve memory up to its soft limit, allowing it to optimize over a wider range of queued transactions. My little idea here is: allocating specific pages to a pool is kind of dumb, all we really want to do is account precisely for the number of pages we are allowed to draw from the global reserve. OK, I kind of digressed, but this all counts as explaining the details of what Peter and I have been up to for the last year (longer for me). At this point, we don't need to do the reserve accounting in the most absolutely perfect way possible, we just need to get something minimal in place to fix the current deadlock problems, then we can iteratively improve it. > I agree, any kind of high-boundary leveling must be implemented in > device itself, since block layer does not know what device is at the > end and what it will need to process given block request. I did not say the throttling has to be implemented in the device, only that we did it there because it was easiest to code that up and try it out (it worked). This throttling really wants to live at a higher level, possibly submit_bio()...bio->endio(). Someone at OLS (James Bottomley?) suggested it would be better done at the request queue layer, but I do not immediately see why that should be. I guess this is going to come down to somebody throwing out a patch for interested folks to poke at. But this detail is a fine point. The big point is to have _some_ throttling mechanism in place on the block IO path, always. Device mapper in particular does not have any throttling itself: calling submit_bio on a device mapper device directly calls the device mapper bio dispatcher. Default initialized block device queue do provide a crude form of throttling based on limiting the number of requests. This is insufficiently precise to do a good job in the long run, but it works for now because the current gaggle of low level block drivers do not have a lot of resource requirements and tend to behave fairly predictably (except for some irritating issues re very slow devices working in parallel with very fast devices, but... worry about that later). Network block driv
Re: Distributed storage.
On Friday 03 August 2007 03:26, Evgeniy Polyakov wrote: > On Thu, Aug 02, 2007 at 02:08:24PM -0700, I wrote: > > I see bits that worry me, e.g.: > > > > + req = mempool_alloc(st->w->req_pool, GFP_NOIO); > > > > which seems to be callable in response to a local request, just the > > case where NBD deadlocks. Your mempool strategy can work reliably > > only if you can prove that the pool allocations of the maximum > > number of requests you can have in flight do not exceed the size of > > the pool. In other words, if you ever take the pool's fallback > > path to normal allocation, you risk deadlock. > > mempool should be allocated to be able to catch up with maximum > in-flight requests, in my tests I was unable to force block layer to > put more than 31 pages in sync, but in one bio. Each request is > essentially dealyed bio processing, so this must handle maximum > number of in-flight bios (if they do not cover multiple nodes, if > they do, then each node requires own request). It depends on the characteristics of the physical and virtual block devices involved. Slow block devices can produce surprising effects. Ddsnap still qualifies as "slow" under certain circumstances (big linear write immediately following a new snapshot). Before we added throttling we would see as many as 800,000 bios in flight. Nice to know the system can actually survive this... mostly. But memory deadlock is a clear and present danger under those conditions and we did hit it (not to mention that read latency sucked beyond belief). Anyway, we added a simple counting semaphore to throttle the bio traffic to a reasonable number and behavior became much nicer, but most importantly, this satisfies one of the primary requirements for avoiding block device memory deadlock: a strictly bounded amount of bio traffic in flight. In fact, we allow some bounded number of non-memalloc bios *plus* however much traffic the mm wants to throw at us in memalloc mode, on the assumption that the mm knows what it is doing and imposes its own bound of in flight bios per device. This needs auditing obviously, but the mm either does that or is buggy. In practice, with this throttling in place we never saw more than 2,000 in flight no matter how hard we hit it, which is about the number we were aiming at. Since we draw our reserve from the main memalloc pool, we can easily handle 2,000 bios in flight, even under extreme conditions. See: http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c down(&info->throttle_sem); To be sure, I am not very proud of this throttling mechanism for various reasons, but the thing is, _any_ throttling mechanism no matter how sucky solves the deadlock problem. Over time I want to move the throttling up into bio submission proper, or perhaps incorporate it in device mapper's queue function, not quite as high up the food chain. Only some stupid little logistical issues stopped me from doing it one of those ways right from the start. I think Peter has also tried some things in this area. Anyway, that part is not pressing because the throttling can be done in the virtual device itself as we do it, even if it is not very pretty there. The point is: you have to throttle the bio traffic. The alternative is to die a horrible death under conditions that may be rare, but _will_ hit somebody. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi Mike, On Thursday 02 August 2007 21:09, Mike Snitzer wrote: > But NBD's synchronous nature is actually an asset when coupled with > MD raid1 as it provides guarantees that the data has _really_ been > mirrored remotely. And bio completion doesn't? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi Evgeniy, Nit alert: On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: > * storage can be formed on top of remote nodes and be exported > simultaneously (iSCSI is peer-to-peer only, NBD requires device > mapper and is synchronous) In fact, NBD has nothing to do with device mapper. I use it as a physical target underneath ddraid (a device mapper plugin) just like I would use your DST if it proves out. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Friday 03 August 2007 07:53, Peter Zijlstra wrote: > On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote: > > On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra wrote: > > ...my main position is to > > allocate per socket reserve from socket's queue, and copy data > > there from main reserve, all of which are allocated either in > > advance (global one) or per sockoption, so that there would be no > > fairness issues what to mark as special and what to not. > > > > Say we have a page per socket, each socket can assign a reserve for > > itself from own memory, this accounts both tx and rx side. Tx is > > not interesting, it is simple, rx has global reserve (always > > allocated on startup or sometime way before reclaim/oom)where data > > is originally received (including skb, shared info and whatever is > > needed, page is just an exmaple), then it is copied into per-socket > > reserve and reused for the next packet. Having per-socket reserve > > allows to have progress in any situation not only in cases where > > single action must be received/processed, and allows to be > > completely fair for all users, but not only special sockets, thus > > admin for example would be allowed to login, ipsec would work and > > so on... > > Ah, I think I understand now. Yes this is indeed a good idea! > > It would be quite doable to implement this on top of that I already > have. We would need to extend the socket with a sock_opt that would > reserve a specified amount of data for that specific socket. And then > on socket demux check if the socket has a non zero reserve and has > not yet exceeded said reserve. If so, process the packet. > > This would also quite neatly work for -rt where we would not want > incomming packet processing to be delayed by memory allocations. At this point we need "anything that works" in mainline as a starting point. By erring on the side of simplicity we can make this understandable for folks who haven't spent the last two years wallowing in it. The page per socket approach is about as simple as it gets. I therefore propose we save our premature optimizations for later. It will also help our cause if we keep any new internal APIs to strictly what is needed to make deadlock go away. Not a whole lot more than just the flag to mark a socket as part of the vm writeout path when you get right down to essentials. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Friday 03 August 2007 06:49, Evgeniy Polyakov wrote: > ...rx has global reserve (always allocated on > startup or sometime way before reclaim/oom)where data is originally > received (including skb, shared info and whatever is needed, page is > just an exmaple), then it is copied into per-socket reserve and > reused for the next packet. Having per-socket reserve allows to have > progress in any situation not only in cases where single action must > be received/processed, and allows to be completely fair for all > users, but not only special sockets, thus admin for example would be > allowed to login, ipsec would work and so on... And when the global reserve is entirely used up your system goes back to dropping vm writeout acknowledgements, not so good. I like your approach, and specifically the copying idea cuts out considerable complexity. But I believe the per-socket flag to mark a socket as part of the vm writeout path is not optional, and in this case it will be a better world if it is a slightly unfair world in favor of vm writeout traffic. Ssh will still work fine even with vm getting priority access to the pool. During memory crunches, non-vm ssh traffic may get bumped till after the crunch, but vm writeout is never supposed to hog the whole machine. If vm writeout hogs your machine long enough to delay an ssh login then that is a vm bug and should be fixed at that level. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: > Hi. > > I'm pleased to announce first release of the distributed storage > subsystem, which allows to form a storage on top of remote and local > nodes, which in turn can be exported to another storage as a node to > form tree-like storages. Excellent! This is precisely what the doctor ordered for the OCFS2-based distributed storage system I have been mumbling about for some time. In fact the dd in ddsnap and ddraid stands for "distributed data". The ddsnap/raid devices do not include an actual network transport, that is expected to be provided by a specialized block device, which up till now has been NBD. But NBD has various deficiencies as you note, in addition to its tendency to deadlock when accessed locally. Your new code base may be just the thing we always wanted. We (zumastor et al) will take it for a drive and see if anything breaks. Memory deadlock is a concern of course. From a cursory glance through, it looks like this code is pretty vm-friendly and you have thought quite a lot about it, however I respectfully invite peterz (obsessive/compulsive memory deadlock hunter) to help give it a good going over with me. I see bits that worry me, e.g.: + req = mempool_alloc(st->w->req_pool, GFP_NOIO); which seems to be callable in response to a local request, just the case where NBD deadlocks. Your mempool strategy can work reliably only if you can prove that the pool allocations of the maximum number of requests you can have in flight do not exceed the size of the pool. In other words, if you ever take the pool's fallback path to normal allocation, you risk deadlock. Anyway, if this is as grand as it seems then I would think we ought to factor out a common transfer core that can be used by all of NBD, iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own code those things have now. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Tuesday 10 July 2007 14:03, Chris Mason wrote: > This patch aims to demonstrate one way to replace buffer heads with a > few extent trees... Hi Chris, Quite terse commentary on algorithms and data structures, but I suppose that is not a problem because Jon has a whole week to reverse engineer it for us. What did you have in mind for subpages? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remainingh
On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote: > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote: > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > > > do you think it is a bit premature to dismiss something even without > > > ever seeing the code? > > > > You told me you are using a dlm for a single-node application, is there > > anything more I need to know? > > I would still like to know why you consider it a "sin". On OpenVMS it is > fast, provides a way of cleaning up... There is something hard about handling EPIPE? > and does not introduce single point > of failure as it is the case with a daemon. And if we ever want to spread > the load between 2 boxes we easily can do it. But you said it runs on an aging Alpha, surely you do not intend to expand it to two aging Alphas? And what makes you think that socket-based synchronization keeps you from spreading out the load over multiple boxes? > Why would I not want to use it? It is not the right tool for the job from what you have told me. You want to get a few bytes of information from one task to another? Use a socket, as God intended. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remainingh
On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > do you think it is a bit premature to dismiss something even without > ever seeing the code? You told me you are using a dlm for a single-node application, is there anything more I need to know? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remainingh
On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote: > On Monday 05 September 2005 23:02, Daniel Phillips wrote: > > By the way, you said "alpha server" not "alpha servers", was that just a > > slip? Because if you don't have a cluster then why are you using a dlm? > > No, it is not a slip. The application is running on just one node, so we > do not really use "distributed" part. However we make heavy use of the > rest of lock manager features, especially lock value blocks. Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature without even having the excuse you were forced to use it. Why don't you just have a daemon that sends your values over a socket? That should be all of a day's coding. Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. But you nicely supported my claim that most who think they should be using a dlm, really shouldn't. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote: > On Monday 05 September 2005 19:57, Daniel Phillips wrote: > > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: > > > On Monday 05 September 2005 10:49, Daniel Phillips wrote: > > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote: > > > > > > The only current users of dlms are cluster filesystems. There > > > > > > are zero users of the userspace dlm api. > > > > > > > > > > That is incorrect... > > > > > > > > Application users Lars, sorry if I did not make that clear. The > > > > issue is whether we need to export an all-singing-all-dancing dlm api > > > > from kernel to userspace today, or whether we can afford to take the > > > > necessary time to get it right while application writers take their > > > > time to have a good think about whether they even need it. > > > > > > If Linux fully supported OpenVMS DLM semantics we could start thinking > > > asbout moving our application onto a Linux box because our alpha server > > > is aging. > > > > > > That's just my user application writer $0.02. > > > > What stops you from trying it with the patch? That kind of feedback > > would be worth way more than $0.02. > > We do not have such plans at the moment and I prefer spending my free > time on tinkering with kernel, not rewriting some in-house application. > Besides, DLM is not the only thing that does not have a drop-in > replacement in Linux. > > You just said you did not know if there are any potential users for the > full DLM and I said there are some. I did not say "potential", I said there are zero dlm applications at the moment. Nobody has picked up the prototype (g)dlm api, used it in an application and said "gee this works great, look what it does". I also claim that most developers who think that using a dlm for application synchronization would be really cool are probably wrong. Use sockets for synchronization exactly as for a single-node, multi-tasking application and you will end up with less code, more obviously correct code, probably more efficient and... you get an optimal, single-node version for free. And I also claim that there is precious little reason to have a full-featured dlm in-kernel. Being in-kernel has no benefit for a userspace application. But being in-kernel does add kernel bloat, because there will be extra features lathered on that are not needed by the only in-kernel user, the cluster filesystem. In the case of your port, you'd be better off hacking up a userspace library to provide OpenVMS dlm semantics exactly, not almost. By the way, you said "alpha server" not "alpha servers", was that just a slip? Because if you don't have a cluster then why are you using a dlm? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: > On Monday 05 September 2005 10:49, Daniel Phillips wrote: > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote: > > > > The only current users of dlms are cluster filesystems. There are > > > > zero users of the userspace dlm api. > > > > > > That is incorrect... > > > > Application users Lars, sorry if I did not make that clear. The issue is > > whether we need to export an all-singing-all-dancing dlm api from kernel > > to userspace today, or whether we can afford to take the necessary time > > to get it right while application writers take their time to have a good > > think about whether they even need it. > > If Linux fully supported OpenVMS DLM semantics we could start thinking > asbout moving our application onto a Linux box because our alpha server is > aging. > > That's just my user application writer $0.02. What stops you from trying it with the patch? That kind of feedback would be worth way more than $0.02. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote: > > The only current users of dlms are cluster filesystems. There are zero > > users of the userspace dlm api. > > That is incorrect... Application users Lars, sorry if I did not make that clear. The issue is whether we need to export an all-singing-all-dancing dlm api from kernel to userspace today, or whether we can afford to take the necessary time to get it right while application writers take their time to have a good think about whether they even need it. > ...and you're contradicting yourself here: How so? Above talks about dlm, below talks about cluster membership. > > What does have to be resolved is a common API for node management. It is > > not just cluster filesystems and their lock managers that have to > > interface to node management. Below the filesystem layer, cluster block > > devices and cluster volume management need to be coordinated by the same > > system, and above the filesystem layer, applications also need to be > > hooked into it. This work is, in a word, incomplete. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Monday 05 September 2005 05:19, Andrew Morton wrote: > David Teigland <[EMAIL PROTECTED]> wrote: > > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote: > > > David Teigland <[EMAIL PROTECTED]> wrote: > > > > We export our full dlm API through read/write/poll on a misc device. > > > > > > inotify did that for a while, but we ended up going with a straight > > > syscall interface. > > > > > > How fat is the dlm interface? ie: how many syscalls would it take? > > > > Four functions: > > create_lockspace() > > release_lockspace() > > lock() > > unlock() > > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone > is likely to object if we reserve those slots. Better take a look at the actual parameter lists to those calls before jumping to conclusions... Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 03:28, Andrew Morton wrote: > If there is already a richer interface into all this code (such as a > syscall one) and it's feasible to migrate the open() tricksies to that API > in the future if it all comes unstuck then OK. That's why I asked (thus > far unsuccessfully): > >Are you saying that the posix-file lookalike interface provides >access to part of the functionality, but there are other APIs which are >used to access the rest of the functionality? If so, what is that >interface, and why cannot that interface offer access to 100% of the >functionality, thus making the posix-file tricks unnecessary? There is no such interface at the moment, nor is one needed in the immediate future. Let's look at the arguments for exporting a dlm to userspace: 1) Since we already have a dlm in kernel, why not just export that and save 100K of userspace library? Answer: because we don't want userspace-only dlm features bulking up the kernel. Answer #2: the extra syscalls and interface baggage serve no useful purpose. 2) But we need to take locks in the same lockspaces as the kernel dlm(s)! Answer: only support tools need to do that. A cut-down locking api is entirely appropriate for this. 3) But the kernel dlm is the only one we have! Answer: easily fixed, a simple matter of coding. But please bear in mind that dlm-style synchronization is probably a bad idea for most cluster applications, particularly ones that already do their synchronization via sockets. In other words, exporting the full dlm api is a red herring. It has nothing to do with getting cluster filesystems up and running. It is really just marketing: it sounds like a great thing for userspace to get a dlm "for free", but it isn't free, it contributes to kernel bloat and it isn't even the most efficient way to do it. If after considering that, we _still_ want to export a dlm api from kernel, then can we please take the necessary time and get it right? The full api requires not only syscall-style elements, but asynchronous events as well, similar to aio. I do not think anybody has a good answer to this today, nor do we even need it to begin porting applications to cluster filesystems. Oracle guys: what is the distributed locking API for RAC? Is the RAC team waiting with bated breath to adopt your kernel-based dlm? If not, why not? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 00:46, Andrew Morton wrote: > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > The model you came up with for dlmfs is beyond cute, it's downright > > clever. > > Actually I think it's rather sick. Taking O_NONBLOCK and making it a > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to > acquire a clustered filesystem lock". Not even close. Now, I see the ocfs2 guys are all ready to back down on this one, but I will at least argue weakly in favor. Sick is a nice word for it, but it is actually not that far off. Normally, this fs will acquire a lock whenever the user creates a virtual file and the create will block until the global lock arrives. With O_NONBLOCK, it will return, erm... ETXTBSY (!) immediately. Is that not what O_NONBLOCK is supposed to accomplish? > It would be much better to do something which explicitly and directly > expresses what you're trying to do rather than this strange "lets do this > because the names sound the same" thing. > > What happens when we want to add some new primitive which has no posix-file > analog? > > Wy too cute. Oh well, whatever. The explicit way is syscalls or a set of ioctls, which he already has the makings of. If there is going to be a userspace api, I would hope it looks more like the contents of userdlm.c than the traditional Vaxcluster API, which sucks beyond belief. Another explicit way is to do it with a whole set of virtual attributes instead of just a single file trying to capture the whole model. That is really unappealing, but I am afraid that is exactly what a whole lot of sysfs/configfs usage is going to end up looking like. But more to the point: we have no urgent need for a userspace dlm api at the moment. Nothing will break if we just put that issue off for a few months, quite the contrary. If the only user is their tools I would say let it go ahead and be cute, even sickeningly so. It is not supposed to be a general dlm api, at least that is my understanding. It is just supposed to be an interface for their tools. Of course it would help to know exactly how those tools use it. Too sleepy to find out tonight... Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 01:00, Joel Becker wrote: > On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote: > > Clearly, I ought to have asked why dlmfs can't be done by configfs. It > > is the same paradigm: drive the kernel logic from user-initiated vfs > > methods. You already have nearly all the right methods in nearly all the > > right places. > > configfs, like sysfs, does not support ->open() or ->release() > callbacks. struct configfs_item_operations { void (*release)(struct config_item *); ssize_t (*show)(struct config_item *, struct attribute *,char *); ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t); int (*allow_link)(struct config_item *src, struct config_item *target); int (*drop_link)(struct config_item *src, struct config_item *target); }; struct configfs_group_operations { struct config_item *(*make_item)(struct config_group *group, const char *name); struct config_group *(*make_group)(struct config_group *group, const char *name); int (*commit_item)(struct config_item *item); void (*drop_item)(struct config_group *group, struct config_item *item); }; You do have ->release and ->make_item/group. If I may hand you a more substantive argument: you don't support user-driven creation of files in configfs, only directories. Dlmfs supports user-created files. But you know, there isn't actually a good reason not to support user-created files in configfs, as dlmfs demonstrates. Anyway, goodnight. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 00:30, Joel Becker wrote: > You asked why dlmfs can't go into sysfs, and I responded. And you got me! In the heat of the moment I overlooked the fact that you and Greg haven't agreed to the merge yet ;-) Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the same paradigm: drive the kernel logic from user-initiated vfs methods. You already have nearly all the right methods in nearly all the right places. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Saturday 03 September 2005 23:06, Joel Becker wrote: > dlmfs is *tiny*. The VFS interface is less than his claimed 500 > lines of savings. It is 640 lines. > The few VFS callbacks do nothing but call DLM > functions. You'd have to replace this VFS glue with sysfs glue, and > probably save very few lines of code. > In addition, sysfs cannot support the dlmfs model. In dlmfs, > mkdir(2) creates a directory representing a DLM domain and mknod(2) > creates the user representation of a lock. sysfs doesn't support > mkdir(2) or mknod(2) at all. I said "configfs" in the email to which you are replying. > More than mkdir() and mknod(), however, dlmfs uses open(2) to > acquire locks from userspace. O_RDONLY acquires a shared read lock (PR > in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a > trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock > is released via close(2). If a process dies, close(2) happens. In > other words, ->release() handles all the cleanup for normal and abnormal > termination. > > sysfs does not allow hooking into ->open() or ->release(). So > this model, and the inherent lifetiming that comes with it, cannot be > used. Configfs has a per-item release method. Configfs has a group open method. What is it that configfs can't do, or can't be made to do trivially? > If dlmfs was changed to use a less intuitive model that fits > sysfs, all the handling of lifetimes and cleanup would have to be added. The model you came up with for dlmfs is beyond cute, it's downright clever. Why mar that achievement by then failing to capitalize on the framework you already have in configfs? By the way, do you agree that dlmfs is too inefficient to be an effective way of exporting your dlm api to user space, except for slow-path applications like you have here? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Saturday 03 September 2005 02:46, Wim Coekaerts wrote: > On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote: > > On Friday 02 September 2005 20:16, Mark Fasheh wrote: > > > As far as userspace dlm apis go, dlmfs already abstracts away a large > > > part of the dlm interaction... > > > > Dumb question, why can't you use sysfs for this instead of rolling your > > own? > > because it's totally different. have a look at what it does. You create a dlm domain when a directory is created. You create a lock resource when a file of that name is opened. You lock the resource when the file is opened. You access the lvb by read/writing the file. Why doesn't that fit the configfs-nee-sysfs model? If it does, the payoff will be about 500 lines saved. This little dlm fs is very slick, but grossly inefficient. Maybe efficiency doesn't matter here since it is just your slow-path userspace tools taking these locks. Please do not even think of proposing this as a way to export a kernel-based dlm for general purpose use! Your userdlm.c file has some hidden gold in it. You have factored the dlm calls far more attractively than the bad old bazillion-parameter Vaxcluster legacy. You are almost in system call zone there. (But note my earlier comment on dlms in general: until there are dlm-based applications, merging a general-purpose dlm API is pointless and has nothing to do with getting your filesystem merged.) Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Saturday 03 September 2005 06:35, David Teigland wrote: > Just a new version, not a big difference. The ondisk format changed a > little making it incompatible with the previous versions. We'd been > holding out on the format change for a long time and thought now would be > a sensible time to finally do it. What exactly was the format change, and for what purpose? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Friday 02 September 2005 20:16, Mark Fasheh wrote: > As far as userspace dlm apis go, dlmfs already abstracts away a large part > of the dlm interaction... Dumb question, why can't you use sysfs for this instead of rolling your own? Side note: you seem to have deleted all the 2.6.12-rc4 patches. Perhaps you forgot that there are dozens of lkml archives pointing at them? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Friday 02 September 2005 17:17, Andi Kleen wrote: > The only thing that should be probably resolved is a common API > for at least the clustered lock manager. Having multiple > incompatible user space APIs for that would be sad. The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. Therefore, the (g)dlm userspace interface actually has nothing to do with the needs of gfs. It should be taken out the gfs patch and merged later, when or if user space applications emerge that need it. Maybe in the meantime it will be possible to come up with a userspace dlm api that isn't completely repulsive. Also, note that the only reason the two current dlms are in-kernel is because it supposedly cuts down on userspace-kernel communication with the cluster filesystems. Then why should a userspace application bother with a an awkward interface to an in-kernel dlm? This is obviously suboptimal. Why not have a userspace dlm for userspace apps, if indeed there are any userspace apps that would need to use dlm-style synchronization instead of more typical socket-based synchronization, or Posix locking, which is already exposed via a standard api? There is actually nothing wrong with having multiple, completely different dlms active at the same time. There is no urgent need to merge them into the one true dlm. It would be a lot better to let them evolve separately and pick the winner a year or two from now. Just think of the dlm as part of the cfs until then. What does have to be resolved is a common API for node management. It is not just cluster filesystems and their lock managers that have to interface to node management. Below the filesystem layer, cluster block devices and cluster volume management need to be coordinated by the same system, and above the filesystem layer, applications also need to be hooked into it. This work is, in a word, incomplete. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ia_attr_flags - time to die
On Friday 02 September 2005 15:41, Miklos Szeredi wrote: > Already dead ;) > > 2.6.13-mm1: remove-ia_attr_flags.patch > > Miklos Wow, the pace of Linux development really is picking up. Now patches are applied before I even send them! Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ia_attr_flags - time to die
Struct iattr is not involved any more in such things as NOATIME inode flags. There are no in-tree users of ia_attr_flags. Signed-off-by Daniel Phillips <[EMAIL PROTECTED]> diff -up --recursive 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2.6.13-rc5-mm1/fs/hostfs/hostfs.h --- 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2005-08-09 18:23:11.0 -0400 +++ 2.6.13-rc5-mm1/fs/hostfs/hostfs.h 2005-09-01 17:54:40.0 -0400 @@ -49,7 +49,6 @@ struct hostfs_iattr { struct timespec ia_atime; struct timespec ia_mtime; struct timespec ia_ctime; - unsigned intia_attr_flags; }; extern int stat_file(const char *path, unsigned long long *inode_out, diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/fs.h 2.6.13-rc5-mm1/include/linux/fs.h --- 2.6.13-rc5-mm1.clean/include/linux/fs.h 2005-08-09 18:23:31.0 -0400 +++ 2.6.13-rc5-mm1/include/linux/fs.h 2005-09-01 18:27:42.0 -0400 @@ -282,19 +282,9 @@ struct iattr { struct timespec ia_atime; struct timespec ia_mtime; struct timespec ia_ctime; - unsigned intia_attr_flags; }; /* - * This is the inode attributes flag definitions - */ -#define ATTR_FLAG_SYNCRONOUS 1 /* Syncronous write */ -#define ATTR_FLAG_NOATIME 2 /* Don't update atime */ -#define ATTR_FLAG_APPEND 4 /* Append-only file */ -#define ATTR_FLAG_IMMUTABLE8 /* Immutable file */ -#define ATTR_FLAG_NODIRATIME 16 /* Don't update atime for directory */ - -/* * Includes for diskquotas. */ #include - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Thursday 01 September 2005 06:46, David Teigland wrote: > I'd like to get a list of specific things remaining for merging. Where are the benchmarks and stability analysis? How many hours does it survive cerberos running on all nodes simultaneously? Where are the testimonials from users? How long has there been a gfs2 filesystem? Note that Reiser4 is still not in mainline a year after it was first offered, why do you think gfs2 should be in mainline after one month? So far, all catches are surface things like bogus spinlocks. Substantive issues have not even begun to be addressed. Patience please, this is going to take a while. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Thursday 01 September 2005 10:49, Alan Cox wrote: > On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote: > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > possibly gain (or vice versa) > > > > - Relative merits of the two offerings > > You missed the important one - people actively use it and have been for > some years. Same reason with have NTFS, HPFS, and all the others. On > that alone it makes sense to include. I thought that gfs2 just appeared last month. Or is it really still just gfs? If there are substantive changes from gfs to gfs2 then obviously they have had practically zero testing, let alone posted benchmarks, testimonials, etc. If it is really still just gfs then the silly-rename should be undone. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Announcing Journaled File System (JFS) release 1.0.0 available
On Thursday 28 June 2001 16:22, Steve Best wrote: > June 28, 2001: > > IBM is pleased to announce the v 1.0.0 release of the open source > Journaled File System (JFS), a high-performance, and scalable file > system for Linux. > > http://oss.software.ibm.com/jfs Congratulations, and thanks for being so clued in about how to run your project. Example: the way you provide the source - direct links to cvs, tgz and patches, no annoying cgi. You guys get it, unlike some other names I won't mention (Sun ;-). -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [PATCH][RFC] inode->u.nfs_i allocated separately
On Thursday 28 June 2001 03:48, Alexander Viro wrote: > On Thu, 28 Jun 2001, Daniel Phillips wrote: > > > Advantages: no extra memory use, no indirection, no memory allocation > > > overhead. > > > > An advantage you overlooked: clean up fs.h so it doesn't have to include > > every filesystem in the known universe. > > > > All of this also applies to struct super_block. > > ... in both schemes. Think for a second - you don't need to include > anything into fs.h to have > void *i_private;/* pointer to fs-private data */ > in struct inode. IOW, that's not an advantage at all - they are not > different in that respect. Yes indeed, both are huge improvements over the union and both require changes to every filesystem. Given that equivalence I'd prefer the one that runs faster and uses less memory. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [PATCH][RFC] inode->u.nfs_i allocated separately
On Thursday 28 June 2001 07:39, Alexander Viro wrote: > BTW, cost of extra dereferncing is trivial - when we access ext2-specific > part of inode we usually > a) do it more than once in a given function > b) access a lot of stuff outside of struct inode. It's not the only cost: - The memory manager overhead is doubled, inode slab fragmentation is doubled - We use an average of half a cache line more per inode, depending on inode size If we choose not to align the inode objects in slab then we waste an extra full cache line (half wasted at the end of the generic inode and half at the beginning of the specific part). -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [PATCH][RFC] inode->u.nfs_i allocated separately
On Wednesday 27 June 2001 23:22, Linus Torvalds wrote: > we could _easily_ have the setup > > struct ext2_inode { > struct inode inode; /* Generic fields */ > specific-ext2 struct; /* specific fields */ > }; > > and then when ext2 calls down to the generic VFS layer it just passes > > &ext2_inode->inode > > down, and when it gets a "struct inode *" it uses "inode_to_ext2()" to > convert it to an ext2 inode pointer. > > This is what the "struct list_head" thing does, and it works remarkably > well. It allows for embedding a list (or a hundred) into any object. The > above would take the same approach, and allow embedding an inode (and > maybe several) into any object. > > Advantages: no extra memory use, no indirection, no memory allocation > overhead. An advantage you overlooked: clean up fs.h so it doesn't have to include every filesystem in the known universe. All of this also applies to struct super_block. > Disadvantages: ?? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: inode->i_blksize and inode->i_blocks
On Monday 04 June 2001 22:41, Bryan Henderson wrote: > I'm more confused than ever about the i_blocks (filesize divided by > 512) field. I note that in both the inode and the stat() result, the > filesize in bytes exists, and presumably always has. So why would > anyone ever want to know separately how many 512 byte units of data > are in the file? Files can have holes. > FS code appears to explicitly allow for a > filesystem driver to denominate i_blocks in other units, but any > other unit would appear to break the stat () interface. This can be fixed with a multiply. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: inode->i_blksize and inode->i_blocks
On Monday 04 June 2001 19:02, Bryan Henderson wrote: > >Are there any deeper reasons, > >why > >a) inode->i_blksize is set to PAGESIZE eg. 4096 independent of the > > actual block size of the file system? > > Well, why not? The field tells what is a good chunk size to read or > write for maximum performance. If the I/O is done in PAGESIZE cache > page units, then that's the best number to use. But we already know that from PAGE_SIZE, this seems like a complete waste. > I suppose in the very first unix filesystems, the field may have > meant filesystem block size, which was identical to the highest > performing read/write size, and that may account for its name. > > >b) the number of blocks is counted in 512 Bytes and not in the > > actual blocksize of the filesystem? > > I can't see how the number of actual blocks would be helpful, > especially since as described above, we don't even know how big they > are. We don't even know that they're fixed size or that a concept of > a block even exists. Counting in 512 byte units was just a mistake. The correct units to count in are sb->s_blocksize. It's a little tricky to change that now but it still may happen. > >(is this for historical reasons??) > > That would be my guess. Though I can't think of any particular > programs that would measure a file by multiplying this number by 512. The original NEC floppy disk controllers used 512 byte sectors. > In any case, the inode fields are defined as they are because they > implement a standard stat() interface that includes these same > numbers. We can fix things up in cp_old/new_stat if we want. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Sunday 27 May 2001 15:32, Edgar Toernig wrote: > Daniel Phillips wrote: > > It won't, the open for "." is handled in the VFS, not the > > filesystem - it will open the directory. (Without needing to be > > told it's a directory via O_DIRECTORY.) If you do open("magicdev") > > you'll get the device, because that's handled by magicdevfs. > > You really mean that "magicdev" is a directory and: > > open("magicdev/.", O_RDONLY); > open("magicdev", O_RDONLY); > > would both succeed but open different objects? Yes, and: open("magicdev/.", O_RDONLY | O_DIRECTORY); open("magicdev", O_RDONLY | O_DIRECTORY); will both succeed and open the same object. > > I'm not claiming there isn't breakage somewhere, > > you break UNIX fundamentals. But I'm quite relieved now because I'm > pretty sure that something like that will never go into the kernel. OK, I'll take that as "I couldn't find a piece of code that breaks, so it's on to the legal issues". SUS doesn't seem to have a lot to say about this. The nearest thing to a ruling I found was "The special filename dot refers to the directory specified by its predecessor". Which is not the same thing as: open("foo", O_RDONLY) == open ("foo/.", O_RDONLY) I don't know about POSIX (I don't have it: a pox on standards organizations that don't make their standards freely available) but SUS doesn't seem to forbid this. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Thursday 24 May 2001 22:59, Edgar Toernig wrote: > Daniel Phillips wrote: > > > > Readdir fills in a directory type, so ls sees it as a directory > > > > and does the right thing. On the other hand, we know we're on > > > > a device filesystem so we will next open the name as a regular > > > > file, and find ISCHR or ISBLK: good. > > > > > > ??? The kernel may know it, but the app? Or do you really want > > > to give different stat data on stat(2) and fstat(2)? These flags > > > are currently used by archive/backup prgs. It's a hint that > > > these files are not regular files and shouldn't be opened for > > > reading. Having a 'd' would mean that they would really try to > > > enter the directory and save it's contents. Don't know what > > > happens in this case to your "special" files ;-) > > > > I guess that's much like the question 'what happens in proc?'. > > And that's already bad enough. Most of the "files" in proc should > be fifos! And using proc as an excuse to introduce another set of > magic dirs? No, thanks. Wait a second, I thought proc was here to stay. Wait another second, device nodes are already magic. Magic is magic, just choose your color ;-) This set of magic dirs is supposed to clean things up, not mess things up. We already saw how the side-effects-on-open problem in ls -l goes away. There's a much bigger problem I'd love to deal with: the 'no heirarchy can please everybody' problem. In database terms, aheirarchy is an insufficiently general model for real-world problems, in other words, they never worked. Tables work. That's where I'm trying to go with this, so please bear with me. This is not just a solution in search of a problem. > > Correct me if I'm wrong, but what we learn from the proc example > > is that tarring your whole source tree starting at / is not > > something you want to do. > > IMHO it would be better to fix proc instead of adding more magic. At > the moment you have to exclude /proc. You want to add /dev. Well, actually no, ls -R, tar, zip, etc, work pretty well with the scheme I've described. > And > next? Exclude all $HOME/dev (in case process name spaces get added)? > Or make fifos magic too and add all of them to the exclude list? But > there's no central place for fifos. So lets add more magic :-( No, no, no, agreed and sometimes magic is good. It's not deep magic. The only new thing here is the interpretation of the O_DIRECTORY flag, or rather, the lack of it. > > What *won't* happen is, you won't get side effects from opening > > your serial ports (you'd have to open them without O_DIRECTORY > > to get that) so that seems like a little step forward. > > As already said: depending on O_DIRECTORY breaks POSIX compliance > and that alone should kill this idea... Thanks, two good points: - libc5 will get confused when doing ls in /magicdev - POSIX specifically forbids this I'll put this away until I've specifically dug into both of them. OK, over and out, thanks for your commentary. /me peruses man pages Oops, oh wait, there's already another open point: your breakage examples both rely on opening ".". You're right, "." should always be a directory and I believe that's enforced by the VFS. So we don't have an example of breakage yet. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Friday 25 May 2001 00:00, Hans Reiser wrote: > Daniel Phillips wrote: > > I suppose I'm just reiterating the obvious, but we should > > eventually have a generic filesystem transaction API at the VFS > > level, once we have enough data points to know what the One True > > API should be. > > Daniel, implementing transactions is not a trivial thing as you > probably know. It requires that you resolve such issues as, what > happens if the user forgets to close the transaction, issues of > lock/transaction duration, of transaction batching, of levels of > isolation, of concurrent transactions modifying global fs metadata > and some but not all of those concurrent transactions receiving a > rollback, and of permissions relating to keeping transactions open. > I would encourage you to participate in the reiser4 design discussion > we will be having over the next 6 months, and give us your opinions. > Josh will be leading that design effort for the ReiserFS team. Graciously accepted. Coming up with something sensible in a mere 6 months would be a minor miracle. ;-) - what happens if the user forgets to close the transaction? I plan to set a checkpoint there (because the transaction got too big) and log the fact that it's open. - issues of lock/transaction duration Once again relying on checkpoints, when the transaction gets uncomfortably big for cache, set a checkpoint. I haven't thought about locks - transaction batching 1) Explicit transaction batch close 2) Cache gets past a certain fullness. In both cases, no new transactions are allowed to start and as soon as all current ones are closed we close the batch. - of levels of isolation - concurrent transactions modifying global fs metadata and some but not all of those concurrent transactions receiving a rollback First I was going to write 'huh?' here, then I realized you're talking about real database ops, not just filesystem ops. I had in mind something more modest: transactions are 'mv', 'read/write' (if the 'atomic read/write' is set), other filesystem operations I've forgotten, and anything the user puts between open_xact and close_xact. You are raising the ante a little ;-) In my case (Tux2) I could do an efficient rollback to the beginning of the batch (phase), then I would have had to have kept an in-memory log of the transactions for selective replay. With a journal log you can obviously do the same thing, but perhaps more efficiently if your journal design supports undo/redo. The above is a pure flight of fancy, we won't be seeing anything so fancy as an API across filesystems. - permissions relating to keeping transactions open. We can see this one in the light of a simple filesystem transaction: what happens if we are in the middle of a mv and someone changes the permissions? Go with the starting or ending permissions? Well, the database side of this is really interesting, but to get something generic across filesystems, the scope pretty well has to be limited to journal-type transactions, don't you think? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Thursday 24 May 2001 23:26, Alexander Viro wrote: > On Thu, 24 May 2001, Edgar Toernig wrote: > > > What *won't* happen is, you won't get side effects from opening > > > your serial ports (you'd have to open them without O_DIRECTORY > > > to get that) so that seems like a little step forward. > > > > As already said: depending on O_DIRECTORY breaks POSIX compliance > > and that alone should kill this idea... > > What really kills that idea is the fact that you can trick > applications into opening your serial ports _without_ O_DIRECTORY. Err, I thought we already had that problem, but worse: an ordinary ls -l will do it. This way, we harmlessly list the device's properties instead. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Thursday 24 May 2001 16:39, Oliver Xymoron wrote: > On Thu, 24 May 2001, Marko Kreen wrote: > > On Thu, May 24, 2001 at 02:23:27AM +0200, Edgar Toernig wrote: > > > Daniel Phillips wrote: > > > > > > It's going to be marked 'd', it's a directory, not a file. > > > > > > > > > > Aha. So you lose the S_ISCHR/BLK attribute. > > > > > > > > Readdir fills in a directory type, so ls sees it as a directory > > > > and does the right thing. On the other hand, we know we're on > > > > a device filesystem so we will next open the name as a regular > > > > file, and find ISCHR or ISBLK: good. > > > > > > ??? The kernel may know it, but the app? Or do you really want > > > to give different stat data on stat(2) and fstat(2)? These flags > > > are currently used by archive/backup prgs. It's a hint that > > > these files are not regular files and shouldn't be opened for > > > reading. Having a 'd' would mean that they would really try to > > > enter the directory and save it's contents. Don't know what > > > happens in this case to your "special" files ;-) > > > > IMHO the CHR/BLK is not needed. Think of /proc. In the future, > > the backup tools will be told to ignore /dev, that's all. > > The /dev dir should not be special. At least not to the kernel. I > have device files in places other than /dev, and you probably do too > (hint: anonymous FTP). True. If we're using a special filesystem for devices we can express the desired restriction in terms of 'don't back up this filesystem type' or 'don't go outside the root filesystem'. -- Daniel -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Tuesday 22 May 2001 22:10, Andreas Dilger wrote: > Peter Braam writes: > > File system journal recovery can corrupt a snapshot, because it > > copies data that needs to be preserved in a snapshot. During > > journal replay such data may be copied again, but the source can > > have new data already. > > The way it is implemented in reiserfs is to wait for existing > transactions to complete, entirely flush the journal and block all > new transactions from starting. Stephen implemented a journal flush > API to do this for ext3, but the hooks to call it from LVM are not in > place yet. This way the journal is totally empty at the time the > snapshot is done, so the read-only copy does not need to do journal > recovery, so no problems can arise. I suppose I'm just reiterating the obvious, but we should eventually have a generic filesystem transaction API at the VFS level, once we have enough data points to know what the One True API should be. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Thursday 24 May 2001 02:23, Edgar Toernig wrote: > Daniel Phillips wrote: > > > > It's going to be marked 'd', it's a directory, not a file. > > > > > > Aha. So you lose the S_ISCHR/BLK attribute. > > > > Readdir fills in a directory type, so ls sees it as a directory and > > does the right thing. On the other hand, we know we're on a device > > filesystem so we will next open the name as a regular file, and > > find ISCHR or ISBLK: good. > > ??? The kernel may know it, but the app? Or do you really want to > give different stat data on stat(2) and fstat(2)? These flags are > currently used by archive/backup prgs. It's a hint that these files > are not regular files and shouldn't be opened for reading. > Having a 'd' would mean that they would really try to enter the > directory and save it's contents. Don't know what happens in this > case to your "special" files ;-) I guess that's much like the question 'what happens in proc?'. Recursively entering the device directory is ok as long as everything inside it is ok. I tried zipping /proc/bus -r and what I got is what I'd expect if I'd cat'ed every non-directory entry. This is what I expected. Maybe it's not right - zipping /proc/kcore is kind of interesting. Regardless, we are no worse than proc here. In fact, since we don't anticipate putting an elephant like kcore in as a device property, we're a little nicer to get along with. Correct me if I'm wrong, but what we learn from the proc example is that tarring your whole source tree starting at / is not something you want to do. Just extend that idea to /dev - however, if you do it, it will produce pretty reasonable results. What *won't* happen is, you won't get side effects from opening your serial ports (you'd have to open them without O_DIRECTORY to get that) so that seems like a little step forward. I'm still thinking about some of your other comments. -- Daniel -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Wednesday 23 May 2001 06:19, Edgar Toernig wrote: > IMO the whole idea of arguments following the device name is junk > (incl a "/ctrl"). You know I didn't suggest that, right? I find it pretty strange too, but I'm listening to hear the technical arguments. > Just think about the implications of the original "/dev/ttyS0/19200" > suggestion. It sounds nice and tempting. But which programs will > benefit. Which gets confused. What will be cleaned up. After some > thoughts you'll find out that it's useless ;-) You know I didn't suggest that either, right? But I'm with you, I don't like it at'all, not least because we might change baud rate on the fly. > And with special "ctrl" devices (ie /dev/ttyS0 and /dev/ttyS0ctrl): > This _may_ work for some kind of devices. But serial ports are one > example where it simply will _not_. It requires that you know the > name of the device. For ttys this is often not the case. > Even if you manage to get some name for stdin for example - now I > should simply attach a "ctrl" to that name to get a control channel??? > At least dangerous. If I'm lucky I only get an EPERM... Again, I'll provide a sympathetic ear, but it wasn't my suggestion. > Ciao, ET. And you were referring to who? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Wednesday 23 May 2001 06:19, Edgar Toernig wrote: > Daniel Phillips wrote: > > On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote: > > > On Mon, 21 May 2001, Daniel Phillips wrote: > > > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote: > > > > > What I'd like to see: > > > > > > > > > > - An interface for registering an array of related devices > > > > > (almost always two: raw and ctl) and their legacy device > > > > > numbers with a single userspace callout that does whatever > > > > > /dev/ creation needs to be done. Thus, naming and permissions > > > > > live in user space. No "device node is also a directory" > > > > > weirdness... > > > > > > > > Could you be specific about what is weird about it? > > > > > > *boggle* > > > > > >[general sense of unease] > > I fully agree with Oliver. It's an abomination. We are, or at least, I am, investigating this question purely on technical grounds - name calling is a noop. I'd be happy to find a real reason why this is a bad idea but so far none has been presented. Don't get me wrong, the fact that people I respect have reservations about the idea does mean something to me, but this still needs to be investigated properly. Now on to the technical content... > > > I don't think it's likely to be even workable. Just consider the > > > directory entry for a moment - is it going to be marked d or > > > [cb]? > > > > It's going to be marked 'd', it's a directory, not a file. > > Aha. So you lose the S_ISCHR/BLK attribute. Readdir fills in a directory type, so ls sees it as a directory and does the right thing. On the other hand, we know we're on a device filesystem so we will next open the name as a regular file, and find ISCHR or ISBLK: good. The rule for this filesystem is: if you open with O_DIRECTORY then directory operations are permitted, nothing else. If you open without O_DIRECTORY then directory operations are forbidden (as usual) and normal device semantics apply. If there is weirdness anywhere, it's right here with this rule. The question is: what if anything breaks? > > > If it doesn't have the directory bit set, Midnight commander > > > won't let me look at it, and I wouldn't blame cd or ls for > > > complaining. If it does have the 'd' bit set, I wouldn't blame > > > cp, tar, find, or a million other programs if they did the wrong > > > thing. They've had 30 years to expect that files aren't > > > directories. They're going to act weird. > > > > No problem, it's a directory. > > Directories are not allowed to be read from/written to. The VFS may > support it, but it's not (current) UNIX. Here, we obey this rule: if you open it with O_DIRECTORY then you can't read from or write to it. > > > Linus has been kicking this idea around for a couple years now > > > and it's still a cute solution looking for a problem. It just > > > doesn't belong in UNIX. > > > > Hmm, ok, do we still have any *technical* reasons? > > So with your definition, I have a fs-object that is marked as a > directory but opening it opens a device. Pretty nice.. No, you have to open it without O_DIRECTORY to get your device fd handle. > How I'm supposed to list it's contents? open+readdir? Nothing breaks here, ls works as it always did. This is what ls does: open("foobar", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(0x3, 0x2, 0x1, 0x2) = -1 ENOSYS (Function not implemented) fcntl(3, F_SETFD, FD_CLOEXEC) = 0 brk(0x805b000) = 0x805b000 getdents64(0x3, 0x8058270, 0x1000, 0x26) = -1 ENOSYS (Function not implemented) getdents(3, /* 2 entries */, 2980) = 28 getdents(3, /* 0 entries */, 2980) = 0 close(3)= 0 Note that ls doesn't do anything as inconvenient as opening foobar as a normal file first, expecting that operation to fail. > But the open has nasty side effects. > So you have a directory that you are not allowed > to list (because of the possible side effects) but is allowed to be > read from/written to maybe even issue ioctls to?. No, you would get side effects only if you open as a regular file. I'd agree that that sucks, but that's not what we're trying to fix just now. > And you call that sane??? I would hope it seems saner now, after the clarification. Please, if you know something that actually breaks, tell me. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Tuesday 22 May 2001 19:49, Oliver Xymoron wrote: > On Tue, 22 May 2001, Daniel Phillips wrote: > > > I don't think it's likely to be even workable. Just consider the > > > directory entry for a moment - is it going to be marked d or > > > [cb]? > > > > It's going to be marked 'd', it's a directory, not a file. > > Are we talking about the same proposal? The one where I can open > /dev/dsp and /dev/dsp/ctl? But I can still do 'cat /dev/hda > > /dev/dsp'? We already support read/write on directories in the VFS, that's not a problem. > It's still a file. If it's not a file anymore, it ain't UNIX. It's a file with the directory bit set, I believe that's UNIX. > > > If it doesn't have the directory bit set, Midnight commander > > > won't let me look at it, and I wouldn't blame cd or ls for > > > complaining. If it does have the 'd' bit set, I wouldn't blame > > > cp, tar, find, or a million other programs if they did the wrong > > > thing. They've had 30 years to expect that files aren't > > > directories. They're going to act weird. > > > > No problem, it's a directory. > > > > > Linus has been kicking this idea around for a couple years now > > > and it's still a cute solution looking for a problem. It just > > > doesn't belong in UNIX. > > > > Hmm, ok, do we still have any *technical* reasons? > > If you define *technical* to not include design, sure. Sorry, I don't see what you mean, do you mean the design is difficult? > Oh, did I mention unnecessary, solvable in userspace? That's exactly the point: the generic filesystem allows all the funny-shaped stuff to be dealt with in user space. The filesystem itself is lovely and clean. BTW, I didn't realize I was reinventing Linus's wheel, this just seemed very obvious and natural to me. So I had to believe there's a technical obstacle somewhere. Has anyone written code to demonstrate the idea? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote: > On Mon, 21 May 2001, Daniel Phillips wrote: > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote: > > > What I'd like to see: > > > > > > - An interface for registering an array of related devices > > > (almost always two: raw and ctl) and their legacy device numbers > > > with a single userspace callout that does whatever /dev/ creation > > > needs to be done. Thus, naming and permissions live in user > > > space. No "device node is also a directory" weirdness... > > > > Could you be specific about what is weird about it? > > *boggle* > >[general sense of unease] > > I don't think it's likely to be even workable. Just consider the > directory entry for a moment - is it going to be marked d or [cb]? It's going to be marked 'd', it's a directory, not a file. > If it doesn't have the directory bit set, Midnight commander won't > let me look at it, and I wouldn't blame cd or ls for complaining. If it > does have the 'd' bit set, I wouldn't blame cp, tar, find, or a > million other programs if they did the wrong thing. They've had 30 > years to expect that files aren't directories. They're going to act > weird. No problem, it's a directory. > Linus has been kicking this idea around for a couple years now and > it's still a cute solution looking for a problem. It just doesn't > belong in UNIX. Hmm, ok, do we still have any *technical* reasons? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Monday 21 May 2001 19:16, Oliver Xymoron wrote: > What I'd like to see: > > - An interface for registering an array of related devices (almost > always two: raw and ctl) and their legacy device numbers with a > single userspace callout that does whatever /dev/ creation needs to > be done. Thus, naming and permissions live in user space. No "device > node is also a directory" weirdness... Could you be specific about what is weird about it? > ...which is overkill in the vast majority of cases. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
On Monday 21 May 2001 14:43, [EMAIL PROTECTED] wrote: > How about: > > # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap > # ls /dev/mypartition > basesizedevicetype > > Generally, we shouldn't care which order the kernel enumerates > devices in or which device number gets assigned internally. If > we did need to care, we'd just do: > > # echo 666 >/dev/mypartition/number > > Only a single thing is of interest. > What is the communication between user space and kernel > that transports device identities? It doesn't change, the same symbolic names still work. What's happening in my example is, we've gotten rid of the can't-get-there-from-here device naming heirarchy. It should be clear by now that we can't capture 'physical device location' and 'device function' in one tree. So instead, 'physical device' is a property of 'logical device'. The tree is now optional. > Note that there is user (human) / user space (programs) / kernel. > > This user has interesting machinery in his hands, > but his programs have only strings (path names, fake or not) > to give to the kernel in open() and mount() calls. > > Now the device path is so complicated that the user is unable to > describe it using a path name. devfs made an attempt listing > controller, lun, etc etc but /dev/ide/host0/bus1/target1/lun0/disc is > not very attractive, and things only get worse. Yes, we flatten that by making host, bus, target and lun all properties of /proc/ide/hda. Our mistake up to now is that we've tried to carry the logical view and physical view of the device in one name, or equivalently, in path+name. Let the physical device be a property of the logical device and we no longer have our thumb tied to our nose. > When I go to a bookshop to buy a book, I can do so without specifying > all of Author, Editors, Title, Publisher, Date, ISBN, nr of pages, > ... A few items suffice. Often the Title alone will do. > > We want an interface where the kernel exports what it has to offer > and the user can pick. Yes, that Zip drive - never mind the bus. > But can distinguish - Yes, that USB Zip drive, not the one > on the parallel port. 100% agreed. IOW, when the device *does* move we can usually deduce where it's moved to, so lets update the hda's bus location automatically whenever we can (log a message!) and only bother the user about it if it's ambiguous. For good measure, have a system setting that says 'on a scale of 0 to 5, this is how interested I am in being bothered about the fact that a device seems to have moved'. > The five minute hack would number devices 1, 2, 3 in order of > detection, offer the detection message in > /devices//detectionmessage and a corresponding device node in > /devices//devicenode. The sysadmin figures out what is what, > makes a collection of symlinks with his favorite names, and everybody > is happy. > > Until the next reboot. Or until device removal and addition. > There must be a way to give permanence to an association > between name and device. Symlinks into a virtual filesystem > like /devices are not good enough. Turning the five minute > hack into a ten minute hack we take the md5sum of the part > of the bootmessage that is expected to be the same the next time > we encounter this device and use that as device number. > > I think a system somewhat in this style could be made to work well. Yes, we are advocating the same thing. I didn't mention that the device properties are supposed to be persistent, did I? If you accept the idea of persistent device properties then the obvious thing to do is to match them up against the detected devices. I didn't want to bring up the persistency thing right away because it begs the question of where you store the persistent data for the root device. Until the namespace issue is resolved this is mainly a distraction. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
On Monday 21 May 2001 10:14, Lars Marowsky-Bree wrote: > On 2001-05-19T16:25:47, > > Daniel Phillips <[EMAIL PROTECTED]> said: > > How about: > > > > # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap > > # ls /dev/mypartition > > base sizedevice type > > # cat /dev/mypartition/size > > 1048576 > > # cat /dev/mypartition/device > > /dev/sda > > # mke2fs /dev/mypartition > > Ek. You want to run mke2fs on a _directory_ ? Could you be specific about what is wrong with that? Assuming that this device directory lives on a special purpose filesystem? > If anything, /dev/mypartition/realdev Then every fstab in the world has to change, not to mention adding verbosity to interactive commands. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
On Saturday 19 May 2001 08:23, Ben LaHaise wrote: > /dev/sda/offset=1024,limit=2048 > -> open a device that gives a view of sda at an > offset of 1KB to 2KB Whatever we end up with, can we express it in terms of base, size, please? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace
On Saturday 19 May 2001 13:37, Eric W. Biederman wrote: > For creating partitions you might want to do: > cat 1024 2048 > /dev/sda/newpartition How about: # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap # ls /dev/mypartition base sizedevice type # cat /dev/mypartition/size 1048576 # cat /dev/mypartition/device /dev/sda # mke2fs /dev/mypartition The information that was specified is persistent in /dev. We can rearrange our physical devices any way we want without affecting the name we chose in /dev. When the kernel enumerates devices at startup, our persistent information better match or we will have to take some corrective action. Generally, we shouldn't care which order the kernel enumerates devices in or which device number gets assigned internally. If we did need to care, we'd just do: # echo 666 >/dev/mypartition/number setting a persistent device minor number. The major number is inherited via the partition's /device property. To set the minor number back to 'don't care': # rm /dev/mypartition/number By taking the physical device off the top of the food chain we gain the flexibility of being able to move the device from bus to bus for example, and only the partition's device property changes, nothing in our fstab. It's no great leap to set things up so that not even the /device property would need to change. Note that we can have a heirarchy of partitions this way if we want to, since /dev/mypartition is just another block device. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: ext3 for 2.4
On Friday 18 May 2001 11:10, Alexander Viro wrote: > On Thu, 17 May 2001, Daniel Phillips wrote: > > Well, if you look how I did the index, it works with blocks and > > buffers while still staying entirely in the page cache. This was > > Stephen's suggestion, and it integrates reliably with Al's > > page-oriented code. So I'm mixing pages and blocks together and > > it's working pretty well. > > ... or, in immortal words of Hans, "Yura, run the benchmarks"... > > > BTW, the parts of Al's patch that I converted from pages to blocks > > got shorter and easier to read. > > > No offense, but your code is a twisted mess to the degree > when bugs are hard to see just because of obfuscation. Al, I was refering to your code, not my code ;-) > Please, find and describe an obvious bug in ext2_add_entry() > (dx-2.4.4-6.pcache version). After deobfuscation it becomes > immediately visible - the only reason why it doesn't stick out like a > sore tumb is that code is a spaghettish mess. > > Generally, when one has to draw a flowchart to figure out > what happens in a function and to find lifetimes of local variables > (couple of dozens of them) it means only one thing: function is > _crap_. > > And yes, flowchart is what I finally had to resort to. > Daniel, I don't care if you consider writing hairy code as a DSW and > frankly, I'm less than impressed by the S of D being demonstrated in > that particular case. > But Al, don't tease me, what bug? -- Daniel Al - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: ext3 for 2.4
On Thursday 17 May 2001 17:53, Andrew Morton wrote: > It's probably worth thinking about adding a fourth journalling > mode: `journal=none'. Roll it all up into a single codebase > and call it ext4. Or ext5 (= ext2 + ext3). > It rather depends on where the buffercache ends up. ext3 is > a client of JBD (nee JFS). JBD does *block* level journalling. > Any major change at that level will take rather some adjusting > to. Well, if you look how I did the index, it works with blocks and buffers while still staying entirely in the page cache. This was Stephen's suggestion, and it integrates reliably with Al's page-oriented code. So I'm mixing pages and blocks together and it's working pretty well. BTW, the parts of Al's patch that I converted from pages to blocks got shorter and easier to read. I'm now working on some code to handle non-data blocks in a similar way, so if this works out it could make the conversion an awful lot less painful for you. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: ext3 for 2.4
On Thursday 17 May 2001 13:20, Andrew Morton wrote: > Summary: ext3 works, page_launder() doesn't :) > > The tree is based on the porting work which Peter Braam did. It's > in cvs in Jeff Garzik's home on sourceforge. Info on CVS is at > http://sourceforge.net/cvs/?group_id=3242 - the module name > is `ext3'. There's a README there which describes how to > apply the patchset. Congratulations to all. Naturally, Ext3 will need a fast directory index, and quickly too, before people start running benchmarks against ReiserFS and XFS. :-) Could you take a look at my indexing patch and see what the journalling issues are? (If any) I have three flavors for you to choose from: 1) Good old buffer cache 2) Page cache, block oriented 3) Page cache, blocks and pages The first two are from the same patch, with a compilation option: http://nl.linux.org/~phillips/htree/dx.testme-2.4.4 And the third is a combination of two patches: ftp://ftp.math.psu.edu/pub/viro/ext2-dir-patch-S4.gz http://nl.linux.org/~phillips/htree/dx.pcache-2.4.4-6 Please take a look and see which style fits best. The pcache patch is the forward-looking one, it's prefered. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]