Re: [NFS] [PATCH] Make UDF exportable
On Wednesday February 6, [EMAIL PROTECTED] wrote: + dotdot.d_name.name = ..; + dotdot.d_name.len = 2; + + lock_kernel(); + if (!udf_find_entry(child-d_inode, dotdot, fibh, cfi)) + goto out_unlock; Have you ever tried this? I think this could never work. UDF doesn't have entry named .. in a directory. You have to search for an entry that has in fileCharacteristics set bit FID_FILE_CHAR_PARENT. Maybe you could hack-around udf_find_entry() to recognize .. dentry and do the search accordingly. Probably not. I just tested that I could read files and navigate the directory structure. However looking into UDF I think you are right - it will fail. I have extended udf_find_entry() to do an explicit check based on fileCharacteristics as you propose. How do I actually test this case? - Mount the filesystem from the server. - 'cd' a few directories down into the filesystem. - reboot the server(1) - on the client 'ls -l'. (1) A full reboot isn't needed. Just unexport, unmount, remount, re-export on the server. alternately, use a non-linux client and cd down into the filesystem and ls -l .. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] VFS: extend /proc/mounts
On Thursday January 17, [EMAIL PROTECTED] wrote: On Jan 17 2008 00:43, Karel Zak wrote: Seems like a plain bad idea to me. There will be any number of home-made /proc/mounts parsers and we don't know what they do. So, let's use /proc/mounts_v2 ;-) Was not it like don't use /proc for new things? I thought it was don't use /proc for new things that aren't process related. And as the mount table is per process.. A host has a bunch of mounted filesystems (struct super_block), and each process has some subset of these stitched together into a mount tree (struct vfsmount / struct namespace). There needs to be something in /proc that exposes the vfsmount tree. Arguably there should be something else - maybe in sysfs - that provides access to the struct superblock object. And there needs to be a clear way to relate information from one with information from the other. In the tradition of stat, statm, status, maybe the former should be /proc/$PID/mountm :-) Hey, I just found /proc/X/mountstats. How does this fit in to the big picture? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Race between generic_forget_inode() and sync_sb_inodes()?
Hi David, On Friday November 30, [EMAIL PROTECTED] wrote: I came across this because I've been making changes to XFS to avoid the inode hash, and I've found that I need to remove the inode from the dirty list when setting I_WILL_FREE to avoid this race. I can't see how this race is avoided when inodes are hashed, so I'm wondering if we've just been lucky or there's something that I'm missing that means the above does not occur. Looking at inode.c in 2.6.23-mm1, in generic_forget_inode, I see code: if (!hlist_unhashed(inode-i_hash)) { if (!(inode-i_state (I_DIRTY|I_SYNC))) list_move(inode-i_list, inode_unused); so it looks to me like: If the inode is hashed and dirty, then move it (off the s_dirty list) to inode_unused. So it seems to me that generic_forget_inode also finds it needs to remove the inode from the dirty list when setting I_WILL_FREE. Maybe we are looking at different kernel versions? Maybe I misunderstood your problem? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Race between generic_forget_inode() and sync_sb_inodes()?
On Friday November 30, [EMAIL PROTECTED] wrote: On Fri, Nov 30, 2007 at 09:07:06AM +1100, Neil Brown wrote: Hi David, On Friday November 30, [EMAIL PROTECTED] wrote: I came across this because I've been making changes to XFS to avoid the inode hash, and I've found that I need to remove the inode from the dirty list when setting I_WILL_FREE to avoid this race. I can't see how this race is avoided when inodes are hashed, so I'm wondering if we've just been lucky or there's something that I'm missing that means the above does not occur. Looking at inode.c in 2.6.23-mm1, in generic_forget_inode, I see code: if (!hlist_unhashed(inode-i_hash)) { if (!(inode-i_state (I_DIRTY|I_SYNC))) list_move(inode-i_list, inode_unused); so it looks to me like: If the inode is hashed and dirty, then move it (off the s_dirty list) to inode_unused. That check is for if the inode is _not_ dirty or being sync, right? Or have I just not had enough coffee this morning? :-) And I cannot even blame the lack of coffee as I don't drink it. My second guess is that we have been lucky which is hard to believe. I wonder if iput (and even iget) should BUG on I_WILL_FREE as well... Perplexed. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive slowdown when re-querying large nfs dir - CORRECTION
On Thursday November 8, [EMAIL PROTECTED] wrote: Not really a credible difference as the reported difference is between two *clients* and the speed of getattr vs lookup would depend on the *server*. Sorry, my bad. I misread your original problem description. It would appear to be a server difference. Maybe an strace -tt of the nfs server might show some significant difference. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive slowdown when re-querying large nfs dir
On Tuesday November 6, [EMAIL PROTECTED] wrote: On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi [EMAIL PROTECTED] wrote: Al Boldi wrote: There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+ entries) using a simple ls -l. On 2.6.23 client and server running userland rpc.nfs.V2: first try: time -p ls -l 2k+ entry dir in ~2.5sec more tries: time -p ls -l 2k+ entry dir in ~8sec first try: time -p ls -l 5k+ entry dir in ~9sec more tries: time -p ls -l 5k+ entry dir in ~180sec On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2: first try: time -p ls -l 2k+ entry dir in ~2.5sec more tries: time -p ls -l 2k+ entry dir in ~7sec first try: time -p ls -l 5k+ entry dir in ~8sec more tries: time -p ls -l 5k+ entry dir in ~43sec Remounting the nfs-dir on the client resets the problem. Any ideas? Ok, I played some more with this, and it turns out that nfsV3 is a lot faster. But, this does not explain why the 2.4.31 kernel is still over 4-times faster than 2.6.23. Can anybody explain what's going on? Sure, Neil can! ;) Nuh. He said userland rpc.nfs.Vx. I only do kernel-land NFS. In these days of high specialisation, each line of code is owned by a different person, and finding the right person is hard I would suggest getting a 'tcpdump -s0' trace and seeing (with wireshark) what is different between the various cases. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday October 30, [EMAIL PROTECTED] wrote: Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. So are you imagining that you might have a distinct snapshotable elements, and that some of these might be combined by e.g. RAID0 into a larger device, then a filesystem is created on that? I ask because my first thought was that the sort of communication you want seems like it would be just between a filesystem and the block device that it talks directly to, and as you are particularly interested in XFS and XVM, should could come up with whatever protocol you want for those two to talk to either other, prototype it, iron out all the issues, then say We've got this really cool thing to make snapshots much faster - wanna share? and thus be presenting from a position of more strength (the old 'code talks' mantra). First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ Reminds me of the new approach to issue_flush_fn which is just to have a zero-length barrier bio (is that implemented yet? I lost track). But different as a zero length barrier has zero length, and your hints have a very meaningful length. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. Would it make sense to allow the bi_sector to be changed by the device and to have that change honoured. i.e. Please allocate 128 blocks, maybe 'here' OK, 128 blocks allocated, but they are actually over 'there'. If the device is tracking what space is and isn't used, it might make life easier for it to do the allocation. Maybe even have a variant Allocate 128 blocks, I don't care where. Is this bio supposed to block until the copy has happened? Or only until the space of the copy has been allocated and possibly committed? Or must it return without doing any IO at all? BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. Would this be a burden on the filesystems? Is my imagined disparity between block sizes valid? Would it be just as easy for the storage device to track small allocation/deallocations? BIO_HINT_DONTCOW (the Bart Simpson BIO). The bio's block extent is not needed in mounted snapshots and does not need to be subjected to COW. This seems like a much more domain-specific function that the other two which themselves could be more generally useful (I'm imagining using hints from them to e.g. accelerate RAID reconstruction). Surely the correct thing to do with the log is to put it on a separate device which itself isn't snapshotted. If you have a storage manager that is smart enough to handle these sorts of things, maybe the functionality you want is Give me a subordinate device which is not snapshotted, size X, then journal to that virtual device. I guess that is equally domain specific, but the difference is that if you try to read from the DONTCOW part of the snapshot, you get bad old data, where as if you try to access the subordinate device of a snapshot, you get an IO error - which is probably safer. Comments? On the whole it seems reasonably sane providing you are from the school which believes that volume managers and filesystems should be kept separate :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thursday October 25, [EMAIL PROTECTED] wrote: On Mon, 22 Oct 2007, Pekka Enberg wrote: On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote: Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc-for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. I think this is where people use the phrase go figure ;) Heh. As far as I can tell, the implication of wider use was added by Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some VFS documentation, so perhaps he might know? Neil? I just read the code, tried to understand it, translated that understanding into English, and put that in vfs.txt. It is very possible that what I wrote didn't match the intention of the author, but it seemed to match the behaviour of the code. The patch looks like it makes perfect sense to me. Before the change, -writepage could return AOP_WRITEPAGE_ACTIVATE without unlocking the page, and this has precisely the effect of: ClearPageReclaim(); (if the call path was through pageout) SetPageActive(); (if the call was through shrink_page_list) unlock_page(); With the patch, the -writepage method does the SetPageActive and the unlock_page, which on the whole seems cleaner. We seem to have lost a call to ClearPageReclaim - I don't know if that is significant. Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew stole his own secret and used it when concocting ramdisk_writepage. Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil revealed the secret to the uninitiated in 2.6.17: now, what's the appropriate punishment for that? Surely the punishment should be for writing hidden undocumented hacks in the first place! I vote we just make him maintainer for the whole kernel - that will keep him so busy that he will never have a chance to do it again :-) --- 2.6.24-rc1/Documentation/filesystems/vfs.txt 2007-10-24 07:15:11.0 +0100 +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 +0100 @@ -567,9 +567,7 @@ struct address_space_operations { If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to try too hard if there are problems, and may choose to write out other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep - calling -writepage on that page. + internal dependencies). It seems that the new requirement is that if the address_space chooses not to write out the page, it should now call SetPageActive(). If that is the case, I think it should be explicit in the documentation - please? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/19] export operations rewrite
On Friday September 14, [EMAIL PROTECTED] wrote: This patchset is a medium scale rewrite of the export operations interface. The goal is to make the interface less complex, and easier to understand from the filesystem side, aswell as preparing generic support for exporting of 64bit inode numbers. These 19 patches Acked-by: NeilBrown [EMAIL PROTECTED] I didn't do a very thorough review this time, but I did the previous time, and the one problem I found (module dependencies) has been addressed - Thanks. NeilBrown This touches all nfs exporting filesystems, and I've done testing on all of the filesystems I have here locally (xfs, ext2, ext3, reiserfs, jfs) Compared to the last version I've fixed the white space issues that checkpatch.pl complained about. Note that this patch series is against mainline. There will be some xfs changes landing in -mm soon that revamp lots of the code touched here. They should hopefully include the first path in the series so it can be simply dropped, but the xfs conversion will need some smaller updates. I will send this update as soon as the xfs tree updates get pulled into -mm. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] block_device_operations prototype changes
On Monday August 27, [EMAIL PROTECTED] wrote: * a bug (AFAICT) in md.c - we open raid components r/w and if it fails, it fails. Good for our purposes, but... how about raid0 on read-only devices? In any case, we have a ready place to store mode_t, so it's not a problem for getting the right value to blkdev_put(). FWIW, I think that allowing fallback to r/o (and making the entire array r/o, of course) would be a good thing to have. Any comments from drivers/md folks? I've never heard any suggestion of anybody wanting to include a readonly device in an md array - I guess one could imagine an array of CDROMs, but I doubt anyone would actually create one. However I agree that falling back to read-only (particular if the error value indicated that this was the only problem) would be a sensible thing to do in some circumstances. More interestingly: one could argue that when the md array is switched to read-only, each component device should be reopened as read-only. So I would really like the interface to have a way to switch a device between read-only and read-write without closing and re-opening. Though I guess opening and then closing as you suggest below would be OK. * open_bdev_excl()/close_bdev_excl(). Needs an extra argument for the latter. Two interesting callers: * kill_block_super() - need to store relevant mode_t in superblock, along with reference to bdev. Note that just looking at sb-s_flags is *not* enough - some filesystems go read-only on errors and that changes -s_flags. Another side of that is explicit remount from r/o to r/w and there we have a bug - we do _not_ tell the driver that something had happened. Proper fix is simple enough - bdget() + blkdev_get() for write + blkdev_put() with old mode_t (i.e. FMODE_READ) once we are done remounting (and similar for explicit remount to r/o). I would *really* like to see this. When the root filesystem gets mounted read-only, I want md to know so that it can mark the array as 'clean' straight away. I really don't like having a reboot-notifier to mark all arrays as 'clean' on shutdown. So each device would be responsible for keeping a count of 'readonly' and 'read-write' accesses? Or would that be in common code? I look forward to this all getting cleaned up! NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version
It just occurred to me: If i_version is 64bit, then knfsd would need to be careful when reading it on a 32bit host. What are the locking rules? Presumably it is only updated under i_mutex protection, but having to get i_mutex to read it would seem a little heavy handed. Should it use a seqlock like i_size? Could we use the same seqlock that i_size uses, or would we need a separate one? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup details
On Friday July 6, [EMAIL PROTECTED] wrote: Hi, On Fri, Jul 06, 2007 at 05:57:49PM +0200, Jörn Engel wrote: ... Interesting idea. Is it possible to attach several address spaces to an inode? That would cure some headaches. GFS2 already uses something like this, in fact by having a second inode to contain the second address space. Thats a bit of a hack but we can't ... So that would certainly be an issue that I'd like to discuss to see what can be worked out in that area, Steve. Maybe the question here is: What support should common code provide for caching indexing metadata? Common code already provides the page cache that is very nice for caching file data. Some filesystems use the blockdev page cache to cache index metadata by physical address. But I think that increasingly filesystems want to cache index metadata by some sort of virtual address. A second page cache address space would be suitable if the addresses were dense, and would be acceptable if the blocks were page-sized (or larger). But for non-dense, non-page-sized blocks, a radix tree of pages is less than ideal (I think). My filesystem (LaFS, which is actually beginning to work thanks to Novell's HackWeek) uses non-dense, non-page-sized blocks both for file indexing and for directories and while I have a working solution for each case, there is room for improvements that might fit well with other filesystems too. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On Tuesday July 3, [EMAIL PROTECTED] wrote: Getting a snapshot that is useful with respect to application data requires help from the application. Certainly. The app needs to be shutdown or paused prior to the snapshot and then started up again after the snapshot is taken. Alternately, the app needs to be able to cope with unexpected system shutdown (aka crash) and the same ability will allow it to cope with an atomic snapshot. It may be able to recover more efficiently from an expected shutdown, so being able to tell the app about an impending snapshot is probably a good idea, but it should be advisory only. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patent or not patent a new idea
On Tuesday June 26, [EMAIL PROTECTED] wrote: Posting it here seems the best thing to do. To the inventor goes naming privilege and I'm calling this one softer raid. It is a form of storage raid implemented in software, as contrasted to software and hardware raid which are dependent on using required hardware. To create a loop filesystem is straight forward. The commands are dd, mkfs.*, mount -o loop. Basically what I propose is that the image file is copied to another harddisk (in the case of ide not on the same cable) and it too is mounted in conjunction of the original with cooperation. When a read request for a block of say 100k is made, the kernel pulls 50k from each disk - maybe a simple doubling of throughput. Sounds a lot like RAIF - ask google for details. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patent or not patent a new idea
On Tuesday June 26, [EMAIL PROTECTED] wrote: Neil Brown wrote: Sounds a lot like RAIF - ask google for details. I did not know about RAIF. RAIF merges separate filesystems? That is a good idea in itself. My idea is for driver that provides a filesystem from image files it controls. Because it knows these resources it uses access in tandem to attain performance. ??? (reads original description in more detail). So... the filesystem images are identical in both copies, and the interesting bit is that the image is just a file on some filesystem. So could I implement your idea by: dd if=/dev/zero of=/1/bigfile count=lotsandlots dd if=/dev/zero of=/2/bigfile count=lotsandlots losetup /dev/loop1 /1/bigfile losetup /dev/loop2 /2/bigfile mdadm -C /dev/md0 --level=raid1 --raid-disks=2 /dev/loop1 /dev/loop2 mkfs /dev/md0 mount /dev/md2 /space ?? Why would you bother? Well, I do it a lot of code testing, but I would be hard pressed to make a case for that sort of config in production. You said: 1) the disks need not be the same size or from the same manufacturer; 2) the supporting code would be cross-platform. md/raid already works happily with different sized drives from different manufacturers (for raid1, it only uses as much space as the smaller drive provides, For raid0 it uses it all). I don't know what you mean by '2'. So I still cannot see anything particularly new. What am I missing? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Patent or not patent a new idea
On Tuesday June 26, [EMAIL PROTECTED] wrote: Thanks for the brief howto there. I'll install the mdadm suite and experiment. It seems like a userspace driver? mdadm is a userspace tool for managing the 'md' driver which is in the linux kernel. I don't know what you mean by '2'. 2 meant in the way that a driver for a PC raid card is not much use on a Sparc. Ahh... That is one of the great things about Linux. You code in C, and the driver works on every architecture (providing you are reasonably careful, and the arch actually supports the hardware). :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] add the fsblock layer
On Sunday June 24, [EMAIL PROTECTED] wrote: +#define PG_blocks20 /* Page has block mappings */ + I've only had a very quick look, but this line looks *very* wrong. You should be using PG_private. There should never be any confusion about whether -private has buffers or blocks attached as the only routines that ever look in -private are address_space operations (or should be. I think 'NULL' is sometimes special cased, as in try_to_release_page. It would be good to do some preliminary work and tidy all that up). Why do you think you need PG_blocks? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
On Friday June 22, [EMAIL PROTECTED] wrote: Yes. Your use case is different than mine. My use case is being able to protect data reliably. Yours? Saying protect data is nearly meaningless without a threat model. I bet you don't try to protect data from a direct nuclear hit, or a court order. AA has a fairly clear threat model. It involves a flaw in a program being used by an external agent to cause it to use privileges it would not normally exercise to subvert privacy or integrity. I think this model matches a lot of real threats that real sysadmins have real concerns about. I believe that the design of AA addresses this model quite well. What is your threat model? Maybe it is just different. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Friday June 1, [EMAIL PROTECTED] wrote: On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: David Chinner wrote: That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate So what if you want a synchronous write, but DON'T care about the order? submit_bio(WRITE_SYNC, bio); Already there, already used by XFS, JFS and direct I/O. Are you sure? You seem to be saying that WRITE_SYNC causes the write to be safe on media before the request returns. That isn't my understanding. I think (from comments near the definition and a quick grep through the code) that WRITE_SYNC expedites the delivery of the request through the elevator, but doesn't do anything special about getting it onto the media. It essentially say Submit this request now, don't wait for more request to bundle with it for better bandwidth utilisation NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Wednesday May 30, [EMAIL PROTECTED] wrote: On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: The aim is to fulfill a NFSv4 requirement for rfc3530: 5.5. Mandatory Attributes - Definitions Name# DataType Access Description ___ change 3 uint64 READ A value created by the server that the client can use to determine if file data, directory contents or attributes of the object File data writes are included in this list of things that need to increment the version field. Hence to fulfill the crash requirement, that implies server data writes either need to be synchronous or journalled... I think that would be excessive. The important property if the 'change' number is: If the 'change' number is still the same, then the data and metadata etc is still the same. The converse of this (if the data+metadata is still then same then the 'change' is still the same) is valuable but less critical. Having the 'change' change when it doesn't need to will defeat client-side caching and so will reduce performance but not correctness. So after a crash, I think it is perfectly acceptable to assign a change number that is known to be different to anything previously supplied if there is any doubt about recent change history. e.g. suppose we had a filesystem with 1-second resolution mtime, and an in-memory 'change' counter that was incremented on every change. When we load an inode from storage, we initialise the counter to -1: if the mtime is earlier than current_seconds current_nanoseconds: if the mtime is equal to current_seconds. We arrange that when the ctime changes, the change number is reset to 0. Then when the 'change' number of an inode is required, we use the bottom 32bits of the 'change' counter and the 32bits of the mtime. This will provide a change number that normally changes only when the file changes and doesn't require any extra storage on disk. The change number will change inappropriately only when the inode has fallen out of cache and is being reload, which is either after a crash (hopefully rare) of when a file hasn't been used for a while, implying that it is unlikely that any client has it in cache. So in summary: I think it is impossible to have a change number that changes *only* when content changes (as your 'crash' example suggests) and it is quite achievable to have a change number that changes rarely when the content doesn't change. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tuesday May 29, [EMAIL PROTECTED] wrote: Neil Brown wrote: md/dm modules could keep count of requests as has been suggested (though that would be a fairly big change for raid0 as it currently doesn't know when a request completes - bi_endio goes directly to the filesystem). Are you sure? I believe that dm handles bi_endio because it waits for all in progress bio to complete before switching tables. I was taking about md/raid0, not dm-stripe. md/raid0 (and md/linear) currently never know that a request has completed. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Monday May 28, [EMAIL PROTECTED] wrote: There are two things I'm not sure you covered. First, disks which don't support flush but do have a cache dirty status bit you can poll at times like shutdown. If there are no drivers which support these, it can be ignored. There are really devices like that? So to implement a flush, you have to stop sending writes and wait and poll - maybe poll every millisecond? That wouldn't be very good for performance maybe you just wouldn't bother with barriers on that sort of device? Which reminds me: What is the best way to turn off barriers? Several filesystems have -o nobarriers or -o barriers=0, or the inverse. md/raid currently uses barriers to write metadata, and there is no way to turn that off. I'm beginning to wonder if that is best. Maybe barrier support should be a function of the device. i.e. the filesystem or whatever always sends barrier requests where it thinks it is appropriate, and the block device tries to honour them to the best of its ability, but if you run blockdev --enforce-barriers=no /dev/sda then you lose some reliability guarantees, but gain some throughput (a bit like the 'async' export option for nfsd). Second, NAS (including nbd?). Is there enough information to handle this really rigt? NAS means lots of things, including NFS and CIFS where this doesn't apply. For 'nbd', it is entirely up to the protocol. If the protocol allows a barrier flag to be sent to the server, then barriers should just work. If it doesn't, then either the server disables write-back caching, or flushes every request, or you lose all barrier guarantees. For 'iscsi', I guess it works just the same as SCSI... NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Monday May 28, [EMAIL PROTECTED] wrote: On Mon, May 28, 2007 at 12:57:53PM +1000, Neil Brown wrote: What exactly do you want to know, and why do you care? If someone explicitly mounts -o barrier and the underlying device cannot do it, then we want to issue a warning or reject the mount. I guess that makes sense. But apparently you cannot tell what a device supports until you write to it. So maybe you need to write some metadata with as a barrier, then ask the device what it's barrier status is. The options might be: YES - barriers are fully handled NO - best effort, but due to missing device features, it might not work DISABLED - admin has requested that barriers be ignored. ?? The idea is that every struct block_device supports barriers. If the underlying hardware doesn't support them directly, then they get simulated by draining the queue and issuing a flush. Ok. But you also seem to be implying that there will be devices that cannot support barriers. It seems there will always be hardware that doesn't meet specs. If a device doesn't support SYNCHRONIZE_CACHE or FUA, then implementing barriers all the way to the media would be hard.. Even if all devices do eventually support barriers, it may take some time before we reach that goal. Why not start by making it easy to determine what the capabilities of each device are. This can then be removed once we reach the holy grail I'd rather not add something that we plan to remove. We currently have -EOPNOTSUP. I don't think there is much point having more than that. I would really like to get to the stage where -EOPNOTSUP is never returned. If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? What if the truth changes (as can happen with md or dm)? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Monday May 28, [EMAIL PROTECTED] wrote: Neil Brown writes: [...] Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. I don't think any filesystem follows all these steps. It seems that steps b/ -- h/ are quite generic, and can be implemented once in a generic code (with some synchronization mechanism like wait-queue at d/). Yes and no. It depends on what you mean by preceding write. If you implement this in the filesystem, the filesystem can wait only for those writes where it has an ordering dependency. If you implement it in common code, then you have to wait for all writes that were previously issued. e.g. If you have two different filesystems on two different partitions on the one device, why should writes in one filesystem wait for a barrier issued in the other filesystem. If you have a single filesystem with one thread doing lot of over-writes (no metadata changes) and the another doing lots of metadata changes (requiring journalling and barriers) why should the data write be held up by the metadata updates? So I'm not actually convinced that doing this is common code is the best approach. But it is the easiest. The common code should provide the barrier and flushing primitives, and the filesystem gets to use them however it likes. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Thanks everyone for your input. There was some very valuable observations in the various emails. I will try to pull most of it together and bring out what seem to be the important points. 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. This is certainly a very attractive position - it makes the interface cleaner and makes life easier for filesystems and other clients of the block interface. Currently filesystems handle -EOPNOTSUP by a/ resubmitting the request without the BARRIER (after waiting for earlier requests to complete) and b/ possibly printing an error message to the kernel logs. The block layer can do both of these just as easily and it does make sense to do it there. md/dm modules could keep count of requests as has been suggested (though that would be a fairly big change for raid0 as it currently doesn't know when a request completes - bi_endio goes directly to the filesystem). However I think the idea of a zero-length BIO_RW_BARRIER would be a good option. raid0 could send one of these down each device, and when they all return, the barrier request can be sent to it's target device(s). I think this is a worthy goal that we should work towards. 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. One possible alternative is: - writes can overtake barriers, but barrier cannot overtake writes. - flush before the barrier, not after. This is considerably weaker, and hence cheaper. But I think it is enough for all filesystems (providing it is still an option to call blkdev_issue_flush on 'fsync'). Another alternative would be to tag each bio was being in a particular barrier-group. Then bio's in different groups could overtake each other in either direction, but a BARRIER request must be totally ordered w.r.t. other requests in the barrier group. This would require an extra bio field, and would give the filesystem more appearance of control. I'm not yet sure how much it would really help... It would allow us to set FUA on all bios with a non-zero barrier-group. That would mean we don't have to flush the entire cache, just those blocks that are critical but I'm still not sure it's a good idea. Of course, these weaker rules would only apply inside the elevator. Once the request goes to the device we need to work with what the device provides, which probably means total-ordering around the barrier. I think this requires more discussion before a way forward is clear. 3/ Do we need explicit control of the 'ordered' mode? Consider a SCSI device that has NV RAM cache. mode_sense reports that write-back is enabled, so _FUA or _FLUSH will be used. But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode. But it seems there is no way to query this information. Using _FLUSH causes the NVRAM to be flushed to media which is a terrible performance problem. Setting SYNC_NV doesn't work on the particular device in question. We currently tell customers to mount with -o nobarriers, but that really feels like the wrong solution. We should be telling the scsi device don't flush. An advantage of 'nobarriers' is it can go in /etc/fstab. Where would you record that a SCSI drive should be set to QUEUE_ORDERD_DRAIN ?? I think the implementation priorities here are: 1/ implement a zero-length BIO_RW_BARRIER option. 2/ Use it (or otherwise) to make all dm and md modules handle barriers (and loop?). 3/ Devise and implement appropriate fall-backs with-in the block layer so that -EOPNOTSUP is never returned. 4/ Remove unneeded cruft from filesystems (and elsewhere). Comments? Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Friday May 25, [EMAIL PROTECTED] wrote: 2007/5/25, Neil Brown [EMAIL PROTECTED]: - Are there other bit that we could handle better? BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean? BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no) error recovery. Mainly used by mutlipath targets to avoid long SCSI recovery. This should just be propagated when passing requests on. Is it much or no? Would it be reasonable to use this for reads from a non-degraded raid1? What about writes? What I would really like is some clarification on what sort of errors get retried, how often, and how much timeout there is.. And does the 'error' code returned in -bi_end_io allow us to differentiate media errors from other errors yet? BIO_RW_SYNC: means this is a bio of a synchronous request. I don't know whether there are more uses to it but this at least causes queues to be flushed immediately instead of waiting for more requests for a short time. Should also just be passed on. Otherwise performance gets poor since something above will rather wait for the current request/bio to complete instead of sending more. Yes, this one is pretty straight forward.. I mentioned it more as a reminder to my self that I really should support it in raid5 :-( NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Monday May 28, [EMAIL PROTECTED] wrote: On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: Thanks everyone for your input. There was some very valuable observations in the various emails. I will try to pull most of it together and bring out what seem to be the important points. 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. Sounds good to me, but how do we test to see if the underlying device supports barriers? Do we just assume that they do and only change behaviour if -o nobarrier is specified in the mount options? What exactly do you want to know, and why do you care? The idea is that every struct block_device supports barriers. If the underlying hardware doesn't support them directly, then they get simulated by draining the queue and issuing a flush. Theoretically there could be devices which have a write-back cache that cannot be flushed, and you couldn't implement barriers on such a device. So throw it out and buy another? As far as I can tell, the only thing XFS does differently with devices that don't support barriers is that it prints a warning message to the kernel logs. If the underlying device printed the message when it detected that barriers couldn't be supported, XFS wouldn't need to care at all. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
This mail is about an issue that has been of concern to me for quite a while and I think it is (well past) time to air it more widely and try to come to a resolution. This issue is how write barriers (the block-device kind, not the memory-barrier kind) should be handled by the various layers. The following is my understanding, which could well be wrong in various specifics. Corrections and other comments are more than welcome. What are barriers? == Barriers (as generated by requests with BIO_RW_BARRIER) are intended to ensure that the data in the barrier request is not visible until all writes submitted earlier are safe on the media, and that the data is safe on the media before any subsequently submitted requests are visible on the device. This is achieved by tagging request in the elevator (or any other request queue) so that no re-ordering is performed around a BIO_RW_BARRIER request, and by sending appropriate commands to the device so that any write-behind caching is defeated by the barrier request. Along side BIO_RW_BARRIER is blkdev_issue_flush which calls q-issue_flush_fn. This can be used to achieve similar effects. There is no guarantee that a device can support BIO_RW_BARRIER - it is always possible that a request will fail with EOPNOTSUPP. Conversely, blkdev_issue_flush must be supported on any device that uses write-behind caching (it if cannot be supported, then write-behind caching should be turned off, at least by default). We can think of there being three types of devices: 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. How does a filesystem use this? === A filesystem will often have a concept of a 'commit' block which makes an assertion about the correctness of other blocks in the filesystem. In the most gross sense, this could be the writing of the superblock of an ext2 filesystem, with the dirty bit clear. This write commits all other writes to the filesystem that precede it. More subtle/useful is the commit block in a journal as with ext3 and others. This write commits some number of preceding writes in the journal or elsewhere. The filesystem will want to ensure that all preceding writes are safe before writing the barrier block. There are two ways to achieve this. 1/ Issue all 'preceding writes', wait for them to complete (bi_endio called), call blkdev_issue_flush, issue the commit write, wait for it to complete, call blkdev_issue_flush a second time. (This is needed for FLUSHABLE) 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). The second, while much easier, can fail. So a filesystem should be prepared to deal with that failure by falling back to the first option. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. I don't think any filesystem follows all these steps. ext3 has the right structure, but it doesn't include steps e and h. reiserfs is similar. It does have a call to blkdev_issue_flush, but that is only on the fsync path, so it isn't really protecting general journal commits. XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' depending on a whether it thinks the device handles barriers, and finally 'g'. I haven't looked at other filesystems. So for devices that support BIO_RW_BARRIER, and for devices that don't need any flush, they work OK, but for device that need flushing, but don't support BIO_RW_BARRIER, none of them work. This should be easy to fix.
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Thursday May 17, [EMAIL PROTECTED] wrote: XOR it (0^0=1), and hence fills up the host disk. Uhmm... you need to check your maths. $ perl -e 'printf %d\n, 0^0;' 0 :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Friday May 18, [EMAIL PROTECTED] wrote: Fix confirmed, filled the whole 11T hard disk, without crashing. I presume this would go into 2.6.22 Yes, and probably 2.6.21.y, though the patch will be slightly different, see below. Thanks again. And thank-you for pursuing this with me. NeilBrown --- Avoid overflow in raid0 calculation with large components. If a raid0 has a component device larger than 4TB, and is accessed on a 32bit machines, then as 'chunk' is unsigned lock, chunk chunksize_bits can overflow (this can be as high as the size of the device in KB). chunk itself will not overflow (without triggering a BUG). So change 'chunk' to be 'sector_t, and get rid of the 'BUG' as it becomes impossible to hit. Cc: Jeff Zheng [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid0.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c --- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000 +++ ./drivers/md/raid0.c2007-05-17 16:14:12.0 +1000 @@ -415,7 +415,7 @@ static int raid0_make_request (request_q raid0_conf_t *conf = mddev_to_conf(mddev); struct strip_zone *zone; mdk_rdev_t *tmp_dev; - unsigned long chunk; + sector_t chunk; sector_t block, rsect; const int rw = bio_data_dir(bio); @@ -470,7 +470,6 @@ static int raid0_make_request (request_q sector_div(x, zone-nb_dev); chunk = x; - BUG_ON(x != (sector_t)chunk); x = block chunksize_bits; tmp_dev = zone-dev[sector_div(x, zone-nb_dev)]; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree
On Wednesday May 16, [EMAIL PROTECTED] wrote: Any ideas what are the minimum export operation(s) that cifs would need to add to export under nfsd? It was not clear to me after reading the Exporting document in Documentation directory. You need to be able to map a dentry to a filehandle (you get about 20 bytes) and back again. If CIFS provides some fix-length identifier for files, then you might be able to do it. If not, cannot really do it at all. And I suspect the later. (There are other requirements, like get_parent, but we could probably work around those if we really needed to). Theoretically, you could make it work with NFSv4 and volatile file handles, but I doubt it would really work in practice. I don't think the volatile concept quite stretch as far as you would need. Probably the best way to nfs-export a CIFS filesystem is to use the user-space nfs server. It caches recently used filenames and uses a filehandle which is a hash of the name. It works on a best-effort basis, and if, for example, the server restarts, you will lose connections to open files. While it is not perfect, it can be very useful in some situations. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Wednesday May 16, [EMAIL PROTECTED] wrote: Here is the information of the created raid0. Hope it is enough. Thanks. Everything looks fine here. The only difference of any significance between the working and non-working configurations is that in the non-working, the component devices are larger than 2Gig, and hence have sector offsets greater than 32 bits. This does cause a slightly different code path in one place, but I cannot see it making a difference. But maybe it does. What architecture is this running on? What C compiler are you using? Can you try with this patch? It is the only thing that I can find that could conceivably go wrong. Thanks, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid0.c |1 + 1 file changed, 1 insertion(+) diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c --- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000 +++ ./drivers/md/raid0.c2007-05-17 10:34:02.0 +1000 @@ -461,6 +461,7 @@ static int raid0_make_request (request_q while (block = (zone-zone_offset + zone-size)) zone++; + BUG_ON(block zone-zone_offset); sect_in_chunk = bio-bi_sector ((chunk_size1) -1); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Thursday May 17, [EMAIL PROTECTED] wrote: The only difference of any significance between the working and non-working configurations is that in the non-working, the component devices are larger than 2Gig, and hence have sector offsets greater than 32 bits. Do u mean 2T here?, but in both configuartion, the component devices are larger than 2T (2.25T5.5T). Yes, I meant 2T, and yes, the components are always over 2T. So I'm at a complete loss. The raid0 code follows the same paths and does the same things and uses 64bit arithmetic where needed. So I have no idea how there could be a difference between these two cases. I'm at a loss... NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree
On Wednesday May 16, [EMAIL PROTECTED] wrote: If CIFS provides some fix-length identifier for files, then you might be able to do it Most CIFS servers (Windows on NTFS, Samba etc.) can return a unique identifier (a 64 bit inode number), in conjunction with the volume id, that is probably good enough ... right? This can be returned on various calls (level 0x03EE file_internal_info - returns only this number). If reverse lookup is required - ie given a unique identifier what is its path name - there are probably a few different ways to handle this but presumably local filesystems run into the same issue. Yes, that unique identifier sounds like it would be suitable to put in the filehandle. But reverse lookup is definitely required. Providing you can turn this into a 'struct inode *' in the filesystem, the code in exportfs/ can help turn that into a fully connected dentry. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Thursday May 17, [EMAIL PROTECTED] wrote: I tried the patch, same problem show up, but no bug_on report Is there any other things I can do? What is the nature of the corruption? Is it data in a file that is wrong when you read it back, or does the filesystem metadata get corrupted? Can you try the configuration that works, and sha1sum the files after you have written them to make sure that they really are correct? My thought here is maybe there is a bad block on one device, and the block is used for data in the 'working' config, and for metadata in the 'broken' config. Can you try a degraded raid10 configuration. e.g. mdadm -C /dev/md1 --level=10 --raid-disks=4 /dev/first missing \ /dev/second missing That will lay out the data in exactly the same place as with raid0, but will use totally different code paths to access it. If you still get a problem, then it isn't in the raid0 code. Maybe try version 1 metadata (mdadm --metadata=1). I doubt that would make a difference, but as I am grasping at straws already, it may be a straw woth trying. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Wednesday May 16, [EMAIL PROTECTED] wrote: On Thu, 17 May 2007, Neil Brown wrote: On Thursday May 17, [EMAIL PROTECTED] wrote: The only difference of any significance between the working and non-working configurations is that in the non-working, the component devices are larger than 2Gig, and hence have sector offsets greater than 32 bits. Do u mean 2T here?, but in both configuartion, the component devices are larger than 2T (2.25T5.5T). Yes, I meant 2T, and yes, the components are always over 2T. 2T decimal or 2T binary? Either. The smallest as actually 2.75T (typo above). Precisely it was 2929641472 kilobytes or 5859282944 sectors or 0x15D3D9000 sectors. So it is over 32bits already... Uhm, I just noticed something. 'chunk' is unsigned long, and when it gets shifted up, we might lose bits. That could still happen with the 4*2.75T arrangement, but is much more likely in the 2*5.5T arrangement. Jeff, can you try this patch? Thanks. NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid0.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c --- .prev/drivers/md/raid0.c2007-05-17 10:33:30.0 +1000 +++ ./drivers/md/raid0.c2007-05-17 15:02:15.0 +1000 @@ -475,7 +475,7 @@ static int raid0_make_request (request_q x = block chunksize_bits; tmp_dev = zone-dev[sector_div(x, zone-nb_dev)]; } - rsect = (((chunk chunksize_bits) + zone-dev_offset)1) + rsect = sector_t)chunk chunksize_bits) + zone-dev_offset)1) + sect_in_chunk; bio-bi_bdev = tmp_dev-bdev; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software raid0 will crash the file-system, when each disk is 5TB
On Thursday May 17, [EMAIL PROTECTED] wrote: Uhm, I just noticed something. 'chunk' is unsigned long, and when it gets shifted up, we might lose bits. That could still happen with the 4*2.75T arrangement, but is much more likely in the 2*5.5T arrangement. Actually, it cannot be a problem with the 4*2.75T arrangement. chuck chunksize_bits will not exceed the size of the underlying device *in*kilobytes*. In that case that is 0xAE9EC800 which will git in a 32bit long. We don't double it to make sectors until after we add zone-dev_offset, which is sector_t and so 64bit arithmetic is used. So I'm quite certain this bug will cause exactly the problems experienced!! Jeff, can you try this patch? Don't bother about the other tests I mentioned, just try this one. Thanks. NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid0.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c --- .prev/drivers/md/raid0.c 2007-05-17 10:33:30.0 +1000 +++ ./drivers/md/raid0.c 2007-05-17 15:02:15.0 +1000 @@ -475,7 +475,7 @@ static int raid0_make_request (request_q x = block chunksize_bits; tmp_dev = zone-dev[sector_div(x, zone-nb_dev)]; } - rsect = (((chunk chunksize_bits) + zone-dev_offset)1) + rsect = sector_t)chunk chunksize_bits) + zone-dev_offset)1) + sect_in_chunk; bio-bi_bdev = tmp_dev-bdev; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software raid0 will crash the file-system, when each disk is 5TB
On Wednesday May 16, [EMAIL PROTECTED] wrote: Anybody have a clue? No... When a raid0 array is assemble, quite a lot of message get printed about number of zones and hash_spacing etc. Can you collect and post those. Both for the failing case (2*5.5T) and the working case (4*2.55T) is possible. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 0/18] export operations rewrite
On Sunday May 13, [EMAIL PROTECTED] wrote: On Mon, May 07, 2007 at 08:51:00AM +0100, Christoph Hellwig wrote: On Mon, May 07, 2007 at 10:06:10AM +1000, Neil Brown wrote: On Monday May 7, [EMAIL PROTECTED] wrote: Would you be able to respin that second patch series with one of those changes? Of course it is actually the first series of patches that introduces this problem. So maybe just a full respin?? I'll take a look at how to solve this issues. I'll take some more time until I can look at this. Can you send of the first patch series to Linus in the meantime? Oh.. the problem is in the second series, isn't it. I had somehow convinced myself it was in the first. And the merge window has now closed, but I'll submit it to -mm shortly. Sorry 'bout that. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 0/18] export operations rewrite
On Friday May 4, [EMAIL PROTECTED] wrote: On Tuesday May 1, [EMAIL PROTECTED] wrote: - the calling conversion on the decode side where we first call -decode_fh to split the filehandle into two blobs only to recurse back into exportfs and then recurse back into the filesystem seems rather odd. By having two methods to get the dentry and parent directly from the on the wire file handle this big callstack collapses to a very simple one. This is the bit I was particularly missing. I see now how this aspect was awkward before, and how your changes make the flow clearer. Getting rid of s_export_op-find_exported_dentry is something I'm very happy about. There was actually bug lurking there that I never got around to fixing. The code: - /* Ok, we can export it */; - if (!inode-i_sb-s_export_op-find_exported_dentry) - inode-i_sb-s_export_op-find_exported_dentry = - find_exported_dentry; assumes that the address of find_exported_dentry will never change. But if you unload nfsd.ko, then re-load it. it could be different, yet a filesystem could still have to old value in it's s_export_op. That would be bad. I'm glad it is gone. ...but now I remember why it was there... I didn't want filesystems to depend on exportfs.ko. nfsd, depends on exportfs, but currently filesystems don't. They can be compiled into the kernel without including exportfs. With your patches in place, I got a compile error with a config in which nfsd is a module (and so exportfs is) but ext2 and ext3 are compiled in. The error was that generic_fh_to_dentry and generic_fh_to_parent were not defined. Now those two functions together with exportfs_d_alloc are fairly small and could go directly in fs/something.c. They could almost be inlined in linux/exportfs.h. Would you be able to respin that second patch series with one of those changes? Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 0/18] export operations rewrite
On Monday May 7, [EMAIL PROTECTED] wrote: Would you be able to respin that second patch series with one of those changes? Of course it is actually the first series of patches that introduces this problem. So maybe just a full respin?? Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] nfsctl: Use vfs_path_lookup
On Sunday May 6, [EMAIL PROTECTED] wrote: use vfs_path_lookup instead of open-coding the necessary functionality. Signed-off-by: Josef 'Jeff' Sipek [EMAIL PROTECTED] Acked-by: NeilBrown [EMAIL PROTECTED] Thanks, NeilBrown --- fs/nfsctl.c | 16 ++-- 1 files changed, 6 insertions(+), 10 deletions(-) diff --git a/fs/nfsctl.c b/fs/nfsctl.c index c043136..51f1b31 100644 --- a/fs/nfsctl.c +++ b/fs/nfsctl.c @@ -23,19 +23,15 @@ static struct file *do_open(char *name, int flags) { struct nameidata nd; + struct vfsmount *mnt; int error; - nd.mnt = do_kern_mount(nfsd, 0, nfsd, NULL); + mnt = do_kern_mount(nfsd, 0, nfsd, NULL); + if (IS_ERR(mnt)) + return (struct file *)mnt; - if (IS_ERR(nd.mnt)) - return (struct file *)nd.mnt; - - nd.dentry = dget(nd.mnt-mnt_root); - nd.last_type = LAST_ROOT; - nd.flags = 0; - nd.depth = 0; - - error = path_walk(name, nd); + error = vfs_path_lookup(mnt-mnt_root, mnt, name, 0, nd); + mntput(mnt);/* drop do_kern_mount reference */ if (error) return ERR_PTR(error); -- 1.5.2.rc1.20.g86b9 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 0/18] export operations rewrite
On Tuesday May 1, [EMAIL PROTECTED] wrote: On Fri, Mar 30, 2007 at 01:34:53PM +1000, Neil Brown wrote: On Saturday March 17, [EMAIL PROTECTED] wrote: less that 2 weeks later more than one month later :) Thanks for your explanations. - the calling conversion on the decode side where we first call -decode_fh to split the filehandle into two blobs only to recurse back into exportfs and then recurse back into the filesystem seems rather odd. By having two methods to get the dentry and parent directly from the on the wire file handle this big callstack collapses to a very simple one. This is the bit I was particularly missing. I see now how this aspect was awkward before, and how your changes make the flow clearer. Getting rid of s_export_op-find_exported_dentry is something I'm very happy about. There was actually bug lurking there that I never got around to fixing. The code: - /* Ok, we can export it */; - if (!inode-i_sb-s_export_op-find_exported_dentry) - inode-i_sb-s_export_op-find_exported_dentry = - find_exported_dentry; assumes that the address of find_exported_dentry will never change. But if you unload nfsd.ko, then re-load it. it could be different, yet a filesystem could still have to old value in it's s_export_op. That would be bad. I'm glad it is gone. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 06/44] mm: trim more holes
On Tuesday April 24, [EMAIL PROTECTED] wrote: If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then we may have failed the write operation despite prepare_write having instantiated blocks past i_size. Fix this, and consolidate the trimming into one place. .. @@ -2025,40 +2012,53 @@ generic_file_buffered_write(struct kiocb cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops-commit_write(file, page, offset, offset+bytes); - if (status == AOP_TRUNCATED_PAGE) { - page_cache_release(page); - continue; + if (unlikely(status 0)) + goto fs_write_aop_error; + if (unlikely(copied != bytes)) { + status = -EFAULT; + goto fs_write_aop_error; } It isn't clear to me that you are handling the case status == AOP_TRUNCATED_PAGE here. AOP_TRUNCATED_PAGE is 0 (0x80001 to be precise) Maybe -commit_write cannot return AOP_TRUNCATED_PAGE. If that is true, then a comment to that effect (i.e. that the old code was wrong) in the change log might easy review. Or did I miss something? Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
On Tuesday April 24, [EMAIL PROTECTED] wrote: + write_begin: This is intended as a replacement for prepare_write. Called +by the generic buffered write code to ask the filesystem to prepare +to write len bytes at the given offset in the file. flags is a field +for AOP_FLAG_xxx flags, described in include/linux/mm.h. Putting This is intended as a replacement.. there sees a bit dangerous. It could well accidentally remain when the documentation for prepare_write gets removed. I would make it a separate paragraph and flesh it out. And include text from prepare_write before that gets removed. write_begin: This is intended as a replacement for prepare_write. The key differences being that: - it returns a locked page (in *pagep) rather than being given a pre-locked page: - it can pass arbitrary state to write_end rather than having to hide stuff in some filesystem-internal data structure - The (largely undocumented) flags option. Called by the generic bufferred write code to ask an address_space to prepare to write len bytes at the given offset in the file. The address_space should check that the write will be able to complete, by allocating space if necessary and doing any other internal housekeeping. If the write will update parts of any basic-blocks on storage, then those blocks should be pre-read (if they haven't been read already) so that the updated blocks can be written out properly. The possible flags are listed in include/linux/fs.h (not mm.h) and include AOP_FLAG_UNINTERRUPTIBLE: It is unclear how this should be used. No current code handles it. (together with the rest...) + +The filesystem must return the locked pagecache page for the caller +to write into. + +A void * may be returned in fsdata, which then gets passed into +write_end. + +Returns 0 on failure, in which case all cleanup must be done and +write_end not called. 0 on success, in which case write_end must +be called. As you are not including perform_write in the current patchset, maybe it is best not to include the documentation yet either? + perform_write: This is a single-call, bulk version of write_begin/write_end +operations. It is only used in the buffered write path (write_begin +must still be implemented), and not for in-kernel writes to pagecache. +It takes an iov_iter structure, which provides a descriptor for the +source data (and has associated iov_iter_xxx helpers to operate on +that data). There are also file, mapping, and pos arguments, which +specify the destination of the data. + +Returns 0 on failure if nothing was written out, otherwise returns +the number of bytes copied into pagecache. + +fs/libfs.c provides a reasonable template to start with, demonstrating +iov_iter routines, and iteration over the destination pagecache. + NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
On Tuesday April 24, [EMAIL PROTECTED] wrote: BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid an initial read or other sequence they might be using to handle the case of a short write. ecryptfs uses it, others can too. For buffered writes, this doesn't get passed in (unless they are coming from kernel space), so I was debating whether to have it at all. However, in the previous API, _nobody_ had to worry about short writes, so this flag means I avoid making an API that is fundamentally less performant in some situations. Ahhh I think I get it now. In general, the address_space must cope with the possibility that fewer than the expected number of bytes is copied. This may leave parts of the page with invalid data. This can be handled by pre-loading the page with valid data, however this may cause a significant performance cost. The write_begin/write_end interface provide two mechanism by which this case can be handled more efficiently. 1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL). If that is set, inefficient preparation can be avoided. However the most common write paths will never set this flag. 2/ The return from write_end can declare that fewer bytes have been accepted. e.g. part of the page may have been loaded from backing store, overwriting some of the newly written bytes. If this return value is reduced, a new write_begin/write_end cycle may be called to attempt to write the bytes again. Also + write_end: After a successful write_begin, and data copy, write_end must +be called. len is the original len passed to write_begin, and copied +is the amount that was able to be copied (they must be equal if +write_begin was called with intr == 0). + That should be ... called without AOP_FLAG_UNINTERRUPTIBLE being set. And that was able to be copied is misleading, as the copy is not done in write_end. Maybe that was accepted. It seems to make sense now. I might try re-reviewing the patches based on this improved understanding only a public holiday looms :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
On Tuesday April 24, [EMAIL PROTECTED] wrote: On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote: On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote: On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote: + page = __grab_cache_page(mapping, index); btw, __grab_cache_page should probably get a more descriptive and non-__-prefixed name now that it's used all over the place. Agreed. Suggestions? ;) find_or_create_cache_page given that's it's like find_or_create_page + add_to_page_cache? find_or_create_page adds to page cache as well, though :P I would really like to see the word 'lock' in there, as it does return a locked page, and when I first saw __grab_cache_page recently there was an unlock_page afterwards and I couldn't see where the lock happened, and I was confused for a little while. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Friday February 23, [EMAIL PROTECTED] wrote: On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote: Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. For that matter, a huge win would be to have the MD RAID layer rewrite only the bad sector (in hopes of the disk relocating it) instead of failing the whiole disk. Otherwise, a few read errors on different disks in a RAID set can take the whole system offline. Apologies if this is already done in recent kernels... Yes, current md does this. And having a way of making this list available to both the filesystem and to a userspace utility, so they can more easily deal with doing a forced rewrite of the bad sector, after determining which file is involved and perhaps doing something intelligent (up to and including automatically requesting a backup system to fetch a backup version of the file, and if it can be determined that the file shouldn't have been changed since the last backup, automatically fixing up the corrupted data block :-). - Ted So we want a clear path for media read errors from the device up to user-space. Stacked devices (like md) would do appropriate mappings maybe (for raid0/linear at least. Other levels wouldn't tolerate errors). There would need to be a limit on the number of 'bad blocks' that is recorded. Maybe a mechanism to clear old bad blocks from the list is needed. Maybe if generic make request gets a request for a block which overlaps a 'bad-block' it returns an error immediately. Do we want a path in the other direction to handle write errors? The file system could say Don't worry to much if this block cannot be written, just return an error and I will write it somewhere else? This might allow md not to fail a whole drive if there is a single write error. Or is that completely un-necessary as all modern devices do bad-block relocation for us? Is there any need for a bad-block-relocating layer in md or dm? What about corrected-error counts? Drives provide them with SMART. The SCSI layer could provide some as well. Md can do a similar thing to some extent. Where these are actually useful predictors of pending failure is unclear, but there could be some value. e.g. after a certain number of recovered errors raid5 could trigger a background consistency check, or a filesystem could trigger a background fsck should it support that. Lots of interesting questions... not so many answers. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mismatch between 2.6.19 and nfs-utils-1.0.10 nfsctl_arg structure???
On Wednesday February 21, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 2007-02-21 00:04:40: You will need patch f988443a84528bd30c2f474efa5e2c511959f19b [1] or run mount -t nfsd nfsd /proc/fs/nfs/nfsd before starting mountd. I applied the patch, and attempted the mount cmd above. I assume you mistyped the directory path, because /proc/fs/nfs/nfsd does not exist, so I used /proc/fs/nfsd Sorry, yes. You chose the right path. Unfortunately it has not made any difference. The differences are not significant. Really? Surely if userspace uses this order struct nfsctl_uidmap u_umap; struct nfsctl_fhparm u_getfh; struct nfsctl_fdparm u_getfd; struct nfsctl_fsparm u_getfs; but kernelspace expects this struct nfsctl_fdparm u_getfd; struct nfsctl_fsparm u_getfs; They are inside a 'union' Order is unimportant. then we have significant differences? But if you're sure, then what else can be wrong? My /etc/exports file contains (adding all 'unsafe' options I can find): /p (rw,sync,no_root_squash,no_subtree_check,crossmnt,insecure,nohide,insecure_locks,no_acl) I tried with and without (empty) hosts.allow and hosts.deny files. I'm running out of ideas :-( Run strace -o /tmp/strace -s 1000 -p `pidof mountd` on the server, then attempt the mount on the client. Then interrupt the 'strace' and send me /tmp/trace. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mismatch between 2.6.19 and nfs-utils-1.0.10 nfsctl_arg structure???
On Tuesday February 20, [EMAIL PROTECTED] wrote: Hi, I'm trying to get an nfs server setup on our embedded 2.6.19 kernel system. I've chosen to use the kernel nfsd module, and am using the latest nfs-utils. It all starts running, but I get this: mount -t nfs 127.0.0.1:/etc /tmp/try mount: 127.0.0.1:/etc failed, reason given by server: Permission denied mount: nfsmount failed: Bad file descriptor mount: Mounting 127.0.0.1:/etc on /tmp/try failed: Bad file descriptor Having added a bit of debugging /var/log/messages shows this: daemon.warn mountd[545]: getfh.c:73: getfh_size (/etc) daemon.warn mountd[545]: nfsctl.c:26: syscall nfsctl daemon.warn mountd[545]: nfsctl.c:28: syscall nfsctl = -1 daemon.warn mountd[545]: getfh failed: Operation not permitted You will need patch f988443a84528bd30c2f474efa5e2c511959f19b [1] or run mount -t nfsd nfsd /proc/fs/nfs/nfsd before starting mountd. Looking in detail at the syscall made, I see that the structure definition for nfsctl_arg is different between kernel and nfs-utils. Huh? What am I missing here? Any pointers appreciated. Wouter. The differences are not significant. NeilBrown [1] http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f988443a84528bd30c2f474efa5e2c511959f19b Kernel 2.6.19.1: include/linux/nfsd/syscall.h: struct nfsctl_arg { int ca_version; /* safeguard */ union { struct nfsctl_svc u_svc; struct nfsctl_clientu_client; struct nfsctl_exportu_export; struct nfsctl_fdparmu_getfd; struct nfsctl_fsparmu_getfs; snip void *u_ptr; } u; nfs-utils 1.0.10: support/include/nfs/nfs.h: struct nfsctl_arg { int ca_version; /* safeguard */ union { struct nfsctl_svc u_svc; struct nfsctl_clientu_client; struct nfsctl_exportu_export; struct nfsctl_uidmapu_umap; struct nfsctl_fhparmu_getfh; struct nfsctl_fdparmu_getfd; struct nfsctl_fsparmu_getfs; } u; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Tuesday February 6, [EMAIL PROTECTED] wrote: On Mon, Feb 05, 2007 at 07:20:35PM -0800, Andreas Gruenbacher wrote: It's actually not hard to fix, and nfsd would look a little less weird. But what would this add, what do pathnames mean in the context of nfsd, and would nfsd actually become less weird? It's not actually a pathname we care about, but a vfsmount + dentry combo. That one means as much in nfsd as elsewhere. We want nfsd to obey r/o or noatime mount flags if /export/foo is exported with them but /foo not. Even better would be to change nfsd so it creates it's own non-visible vfsmount for the filesystems it exports.. What would be the benefit of having private non-visible vfsmounts? Sounds like a recipe for confusion? It is possible that mountd might start doing bind-mounts to create the 'pseudo filesystem' thing for NFSv4, but they would be very visible (under /var/lib/nfs/v4root or something). So having it's own vfsmount might make sense, but I don't get 'non-visible'. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
On Tuesday February 6, [EMAIL PROTECTED] wrote: On Tue, Feb 06, 2007 at 12:51:52AM -0800, Trond Myklebust wrote: Who cares? There is no way to export a partial directory, and in any case the subtree_check crap is borken beyond repair (see cross-directory renames which lead to actual changes to the filehandle - broken, broken, broken). It's getting a little oftopic for this thread, but is there a chance we could just kill that crap after a resonable deprecation period? Probably. A couple of years? nfs-utils 1.1.0 is due to stop subtree_check from being the default. After that is released, the next kernel can start printing warnings that it might stop working in a couple of years. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix d_path for lazy unmounts
On Friday February 2, [EMAIL PROTECTED] wrote: Hello, here is a bugfix to d_path. Please apply (after 2.6.20). Looks good! Just a couple of little comments (to prove that I have really read it and thought about it :-) Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED] Reviewed-by: NeilBrown [EMAIL PROTECTED] Index: linux-2.6/fs/dcache.c === --- linux-2.6.orig/fs/dcache.c +++ linux-2.6/fs/dcache.c @@ -1739,45 +1739,43 @@ shouldnt_be_hashed: * @rootmnt: vfsmnt to which the root dentry belongs * @buffer: buffer to return value in * @buflen: buffer length + * @fail_deleted: what to return for deleted files * - * Convert a dentry into an ASCII path name. If the entry has been deleted - * the string (deleted) is appended. Note that this is ambiguous. + * Convert a dentry into an ASCII path name. If the entry has been deleted, + * then if @fail_deleted is true, ERR_PTR(-ENOENT) is returned. Otherwise, + * the the string (deleted) is appended. Note that this is ambiguous. The behaviour in the face of a lazy unmount should be clarified in this comment. And I cannot help wondering if that behaviour should - in some cases - be 'fail'. i.e. if sys_getcwd is called on a directory that is no longer connected to the root, it isn't clear to me that it should return without an error. Without your patch it can return garbage which is clearly wrong. With you patch it will return a relative path name, which is also wrong (it isn't a valid path that leads to the current working directory). I would suggest that 'fail_deleted' be (e.g.) changed to 'fail_condition' where two conditions are defined #define DPATH_FAIL_DELETED 1 #define DPATH_FAIL_DISCONNECTED 2 and both these are passed in by sys_getcwd. @@ -1791,33 +1789,49 @@ static char * __d_path( struct dentry *d parent = dentry-d_parent; prefetch(parent); namelen = dentry-d_name.len; - buflen -= namelen + 1; - if (buflen 0) + if (buflen = namelen) goto Elong; This bothers me. You appear to be comparing buflen with namelen, but are about the change buflen by 'namelen + 1'. It looks like a bug. In reality, you are comparing buflen namelen+1 but spelling it as buflen = namelen. I would prefer the full spelling with least room for confusion. - end -= namelen; - memcpy(end, dentry-d_name.name, namelen); - *--end = '/'; - retval = end; + buflen -= namelen + 1; + buffer -= namelen; + memcpy(buffer, dentry-d_name.name, namelen); + *--buffer = '/'; This wouldn't be my preference either. It is important that 'buflen' and 'buffer' move in step, but when I see buflen -= namelen + 1; buffer -= namelen; it looks like they aren't. Maybe: + buflen -= namelen + 1; + buffer -= namelen + 1; + memcpy(buffer+1, dentry-d_name.name, namelen); + *buffer = '/'; or am I being too picky? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File system enhancement handled above the file system level.
On Wednesday November 29, [EMAIL PROTECTED] wrote: Neil Brown wrote: "hlfsd" stands for Home Link File System Daemon, and a google search on hlfsd finds it quite nicely. --- So we would configure /tmp to be type of NFS server served by hlfsd which would point to some real tmp space. In our case we want to segregate by 'group' or a process's 'label'. So people in the same group/label would share a common 'tmp'. For an NFS server, the only information that is available is the uid and a list of groupids. If you wanted it to be group based you would need to create some rule for choosing which group. Maybe groups with gid between 1 and 11000 could be "name-space-groups" and you could administratively assure that each user is only ever in one of these. This might require changing the newgroup command if people use that. Given then, it should be trivial to modify hlfsd to pick the gid in the right range and stick it in the returned symlink. If you want to base the selection on something else, you would have to write your own filesystem, but that shouldn't be too hard (about an hour's work I suspect). You would want the be able to determine the criterea directly from the task structure though. I don't think that digging around in the environment, which is somewhere in userspace, is a good idea. But you will presumably be using group ids to control access to these tmp directories, so using groups control location of them is probably acceptable too?? NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]