Re: [PATCH 1/2] Make cramfs little endian only
On Thu, 6 Dec 2007, Jörn Engel wrote: > > How about shifting and masking _before_ converting to host endianness? > > static inline u32 cramfs_offset(struct cramfs_inode *inode) > { > return le32_to_cpu(node->namelen_offset >> > CRAMFS_NAMELEN_WIDTH); > } Stop this idiocy. The code I sent out was correct. You're just making it worse. "le32_to_cpu()" is a no-op on little-endian machines. So your change won't make any difference there. But the point is, that if we want the disk layout to be the *same* for both big-endian and little-endian, we need to switch the word as it is loaded from memory, and not do *any* operations on it before we've done that equalization. This is not something unusual. It's bog-standard procedure for a lot of filesystems. And sparse will help you (and would have complained about your code that tries to shift a little-endian entity before changing it into a CPU-endian one). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Make cramfs little endian only
On Thu, 6 Dec 2007, Andi Drebes wrote: > > This requires changing the on-disk-structure (even the current "little endian > only" one). That makes no sense. Your whole patch is about making it little-endian only. If you do that, then big-endian is screwed. I'm ok with that, and support it. But then you just make sure that the little-endian bits are the same, and now you're *done*. > For little endian machines, the data arranged in the following way: > > |o02.o01.n06.n05.n04.n03.n02.n01|o10.o09.o08.o07.o06.o05.o04.o03| > |o18.o17.o16.o15.o14.o13.o12.o11|o26.o25.o24.o23.o22.o21.o20.o19| That's a singularly confused way or looking at it. The point is, the first byte in a little-endian machine is the lowest bits, so the *correct* way of looking at it is to see the above as one 32-bit word, and then it looks like this: bit-in-word: 31 .. 6 5 .. 0 bit-in-bitfield o25 .. o0 n5 .. n0 and my code works *perfectly*. When you do static inline u32 cramfs_offset(struct cramfs_inode *inode) { return le32_to_cpu(node->namelen_offset) >> CRAMFS_NAMELEN_WIDTH; } you are getting *exactly* the bits "o25..o0" that you want (ie the offset). > So masking and then shifting *the whole* masked data doesn't solve the > problem. Yes it does. Try it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Make cramfs little endian only
On Wed, 5 Dec 2007, Andi Drebes wrote: > > +#ifdef __BIG_ENDIAN > + > +/* Converts a cramfs_inode's offset field > + from little endianto cpu endian */ > +static u32 cramfs_offset_to_cpu(struct cramfs_inode *inode) > +{ > + u8 *inode_bytes = (u8 *)inode; > + return ((inode_bytes[8] & 0xc0) >> 6) | (inode_bytes[9] << 2) | > + (inode_bytes[10] << 10) | (inode_bytes[11] << 18); > +} ... > +#elif defined(__LITTLE_ENDIAN) > + > +/* Converts a cramfs_inode's offset field > + from little endian to cpu endian */ > +static u32 cramfs_offset_to_cpu(struct cramfs_inode *inode) > +{ > + return inode->offset; > +} No, no, what I meant about not having any #ifdef __LITTLE_ENDIAN was to do the code do the same thing *regardless* of endianness. In other words, a simple: struct cramfs_inode { __le32 mode_uid;/* CRAMFS_MODE_WIDTH:CRAMFS_UID_WIDTH */ __le32 size_gid;/* CRAMFS_SIZE_WIDTH:CRAMFS_GID_WIDTH */ __le32 namelen_offset; /* CRAMFS_NAMELEN_WIDTH:CRAMFS_OFFSET_WIDTH */ }; #define CRAMFS_UID_MASK ((1ul << CRAMFS_UID_WIDTH)-1) #define CRAMFS_GID_MASK ((1ul << CRAMFS_GID_WIDTH)-1) #define CRAMFS_NAMELEN_MASK ((1ul << CRAMFS_NAMELEN_WIDTH)-1) static inline u32 cramfs_mode(struct cramfs_inode *inode) { return le32_to_cpu(node->mode_uid) >> CRAMFS_UID_WIDTH; } static inline u32 cramfs_uid(struct cramfs_inode *inode) { return le32_to_cpu(node->mode_uid) & CRAMFS_UID_MASK; } static inline u32 cramfs_size(struct cramfs_inode *inode) { return le32_to_cpu(node->size_gid) >> CRAMFS_GID_WIDTH; } static inline u32 cramfs_gid(struct cramfs_inode *inode) { return le32_to_cpu(node->size_gid) & CRAMFS_GID_MASK; } static inline u32 cramfs_offset(struct cramfs_inode *inode) { return le32_to_cpu(node->namelen_offset) >> CRAMFS_NAMELEN_WIDTH; } static inline u32 cramfs_namelen(struct cramfs_inode *inode) { return le32_to_cpu(node->namelen_offset) & CRAMFS_NAMELEN_MASK; } See? No #ifdef's required, no different code-paths, and the code generated will be pretty much optimal too. (No, the above is not tested in any way, shape, or form, and no, I didn't double-check that I actually extracted the right bits, but you get the idea). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Make cramfs little endian only
On Tue, 4 Dec 2007, Andi Drebes wrote: > > Perhaps I'm missing somehting, but I think for cramfs, unfortunately, > there has to be this statement. The bitfields in the cramfs_inode structure > cause some problems. I agree that bitfields can be painful, but they should likely be just rewritten to be accesses using actual masks and shifts. The thing is, bitfields aren't actually endianness safe *anyway*, in that a compiler may end up using a *different* bit order than the byte order. So you cannot really use bitfields reliably on things like that (although Linux has a notion of a "__[BIG|LITTLE]_ENDIAN_BITFIELD", if you really want to). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Make cramfs little endian only
On Tue, 4 Dec 2007, Andi Drebes wrote: > > Sure. This saves some definitions (and lines of code)... > Here's the new patch (tested on the same machines mentioned in the first > message). > I tried to move as many lines as possible out of the endian dependent section. This really is the totally wrong way to do this. You should *not* convert inodes to CPU-endian mode at all. You should *keep* them in the native mode, and then just use "le32_to_cpu()" when actually using them. Basically, if you ever have a #ifdef __BIG_ENDIAN or similar in the source code, you're simply doing something wrong. Btw, sparse can be a big help for things like this, by just marking the actual disk data structures as being the right type (ie "__le32" and friends), and then you can verify that any users will use "le32_to_cpu()" as required, because sparse will warn about bad endianness. So don't copy things around and change them from one mode to the other. Keep all the data structures in native LE ordering, and only when you really need to use them for some real data do you need to do the conversion. Now, for example, your code has two different "struct cramfs_inode" that you use: the unconverted and the converted one. That's confusing, but it's also error-prone. If you just have one type - the unconverted little-endian one - you have none of those issues. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] cramfs: Add mount option "swapendian"
On Fri, 16 Nov 2007, Andi Drebes wrote: > > I'll write a new patch for inclusion in the mainline kernel that makes > cramfs "little endian only". For people who really want to be able to > mount filesystems with both kinds of endianness there will be a seperate > patch (not intended to be merged into mainline) available somewhere on > the net (my website or so). It will definitely be marked as non-official > in order to prevent people from creating images in big endian. > Officially, cramfs should only support little endian images. Good. We do actually have precedence for exactly this kind of situation before: it's what happened to ext2 as well (people were trying to push a switch-endian version, and I said no) and the m68k people had their own patches for a while but we're *so* much better off from being fixed-endian that it's not even funny. > If squashfs will be merged, cramfs should be marked as obsolete. Sounds like that, yes. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] cramfs: Add mount option "swapendian"
On Thu, 15 Nov 2007, Andi Drebes wrote: > > Actually, in my eyes, it would be better to create a new version of cramfs > that standardizes the endianness and the block size. Perhaps even more importantly, you can do a much better job. The thing is, when I did cramfs originally, I had a total "senior moment", and the reason I didn't compress the metadata was that it appeared hard to do and fill it in later (ie compressing the metadata would mean that the location of the data changes - and since the metadata obviously contains pointers to the data, there's a chicken-and-egg problem there). But it should be *trivial* to compress the metadata too if the code just were to put the metadata at the beginning of the image, and the real data at the end, and then you can build up the image from both ends and you *can* have a fixed starting point for the data (up at the top of the image) even though you are changing the size of the metadata by compression. But I literally designed and wrote the thing in a couple of days, and I really didn't think it through right. As a result, the metadata may be dense, but it's totally uncompressed. It would have been better to allow a less dense basic format (allowing bigger uid/gid values, and offsets and file sizes), but compress it. So a "v2" cramfs would be a great idea. Not just for fixing endianness etc. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] cramfs: Add mount option "swapendian"
On Thu, 15 Nov 2007, Christoph Hellwig wrote: > > Actually there are as lots of initrd are cramfs. This means you'd need > to update mkcramfs all big endian machines out there. So? Normally the initrd goes with the kernel anyway. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] cramfs: Add mount option "swapendian"
On Thu, 15 Nov 2007, Linus Torvalds wrote: > > It would be *much* better to just standardize on one endianness, and be > done with it. That way there are no config options, no confusion, and the > code is smaller, simpler, and faster. .. it's also statically checkable with tools like "sparse", so it avoids bugs not only by being simpler, but by simply being fundamentally more robust to start with. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] cramfs: Add mount option "swapendian"
On Thu, 15 Nov 2007, Andi Drebes wrote: > > This patch introduces the mount option "swapendian" for cramfs. > When this option is set, cramfs is able to mount an image that > was created on a machine whose endianness differs from the mounting > machine's one. Please don't do it this way. It would be *much* better to just standardize on one endianness, and be done with it. That way there are no config options, no confusion, and the code is smaller, simpler, and faster. Because nn unconditional byte swap is generally faster than a conditional non-byte-swap! So can you please just make it little-endian? There can't be that many big-endian machines that really care about old cramfs images.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()
On Mon, 1 Oct 2007, Christoph Hellwig wrote: > > befs_lookup, which the above gem is from, returns a dentry *. Ahh, ok. Then it actually makes sense. Although I'd prefer it if people planned on writing code like that more along the lines of error = PTR_ERR(inode); if (IS_ERR(inode) return ERR_PTR(error); because let's face it, the compiler can turn this into nice code, and humans can read it much better even if it's got an "unnecessary" extra variable etc.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/30] IGET: Stop BEFS from using iget() and read_inode()
On Mon, 1 Oct 2007, Zach Brown wrote: > > If you're soliciting opinions, I think I tend to prefer the feel of the > code paths after the changes. I don't know the benefits of the change > are worth the risk in unmaintained file systems, though. > > > + return ERR_PTR(PTR_ERR(inode)); > > This caught my eye. Surely we can do better :). It seems to happen a > few times in the patches, the instance in this patch was the first that > I noticed. Yeah. The above obviously *should* be the same as return inode; apart from a few casts. But if the casts matter, there's something else wrong. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wed, 19 Sep 2007, Rene Herman wrote: > > I do feel larger blocksizes continue to make sense in general though. Packet > writing on CD/DVD is a problem already today since the hardware needs 32K or > 64K blocks and I'd expect to see more of these and similiar situations when > flash gets (even) more popular which it sort of inevitably is going to be. .. that's what scatter-gather exists for. What's so hard with just realizing that physical memory isn't contiguous? It's why we have MMU's. It's why we have scatter-gather. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wed, 19 Sep 2007, Rene Herman wrote: > > Well, not so sure about that. What if one of your expected uses for example is > video data storage -- lots of data, especially for multiple streams, and needs > still relatively fast machinery. Why would you care for the overhead af > _small_ blocks? .. so work with an extent-based filesystem instead. 16k blocks are total idiocy. If this wasn't about a "support legacy customers", I think the whole patch-series has been a total waste of time. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wed, 19 Sep 2007, Nathan Scott wrote: > > FWIW (and I hate to let reality get in the way of a good conspiracy) - > all SGI systems have always defaulted to using 4K blocksize filesystems; Yes. And I've been told that: > there's very few customers who would use larger .. who apparently would like to move to x86-64. That was what people implied at the kernel summit. >especially as the Linux > kernel limitations in this area are well known. There's no "16K mess" > that SGI is trying to clean up here (and SGI have offered both IA64 and > x86_64 systems for some time now, so not sure how you came up with that > whacko theory). Well, if that is the case, then I vote that we drop the whole patch-series entirely. It clearly has no reason for existing at all. There is *no* valid reason for 16kB blocksizes unless you have legacy issues. The performance issues have nothing to do with the block-size, and should be solvable by just making sure that your stupid "state of the art" crap SCSI controller gets contiguous physical memory, which is best done in the read-ahead code. So get your stories straight, people. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, 18 Sep 2007, Andrea Arcangeli wrote: > > Many? I can't recall anything besides PF_MEMALLOC and the decision > that the VM is oom. *All* of the buddy bitmaps, *all* of the GPF_ATOMIC, *all* of the zone watermarks, everything that we depend on every single day, is in the end just about statistically workable. We do 1- and 2-order allocations all the time, and we "know" they work. Yet Nick (and this whole *idiotic* thread) has all been about how they cannot work. > In general every time reliability has a low priority than performance > I've an hard time to enjoy it. This is not about performance. Never has been. It's about SGI wanting a way out of their current 16kB mess. The way to fix performance is to move to x86-64, and use 4kB pages and be happy. However, the SGI people want a 16kB (and possibly bigger) crap-option for their people who are (often _already_) running some special case situation that nobody else cares about. It's not about "performance". If it was, they would never have used ia64 in the first place. It's about special-case users that do odd things. Nobody sane would *ever* argue for 16kB+ blocksizes in general. Linus PS. Yes, I realize that there's a lot of insane people out there. However, we generally don't do kernel design decisions based on them. But we can pat the insane users on the head and say "we won't guarantee it works, but if you eat your prozac, and don't bother us, go do your stupid things". - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, 18 Sep 2007, Nick Piggin wrote: > > ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer > as the solution to the fragmentation problem? It would only have been funnier > if you had said to reboot every so often when memory gets fragmented :) Can we please stop this *idiotic* thread. Nick, you and some others seem to be arguing based on a totally flawed base, namely: - we can guarantee anything at all in the VM - we even care about the 16kB blocksize - second-class citizenry is "bad" The fact is, *none* of those things are true. The VM doesn't guarantee anything, and is already very much about statistics in many places. You seem to be arguing as if Christoph was introducing something new and unacceptable, when it's largely just more of the same. And the fact is, nobody but SGI customers would ever want the 16kB blocksize. IOW - NONE OF THIS MATTERS! Can you guys stop this inane thread already, or at least take it private between you guys, instead of forcing everybody else to listen in on your flamefest. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Sun, 16 Sep 2007, Jörn Engel wrote: > > My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc. > which are pinned for their entire lifetime and another for regular > files/inodes. One could take a three-way approach and have > always-pinned, often-pinned and rarely-pinned. > > We won't get never-pinned that way. That sounds pretty good. The problem, of course, is that most of the time, the actual dentry allocation itself is done before you really know which case the dentry will be in, and the natural place for actually giving the dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate" it with d_add() or d_instantiate(). But it turns out that most of the filesystems we care about already use a special case of "d_add()" that *already* replaces the dentry with another one in some cases: "d_splice_alias()". So I bet that if we just taught "d_splice_alias()" to look at the inode, and based on the inode just re-allocate the dentry to some other slab cache, we'd already handle a lot of the cases! And yes, you'd end up with the reallocation overhead quite often, but at least it would now happen only when filling in a dentry, not in the (*much* more critical) cached lookup path. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Sun, 16 Sep 2007, Jörn Engel wrote: > > I have been toying with the idea of having seperate caches for pinned > and movable dentries. Downside of such a patch would be the number of > memcpy() operations when moving dentries from one cache to the other. Totally inappropriate. I bet 99% of all "dentry_lookup()" calls involve turning the last dentry from having a count of zero ("movable") to having a count of 1 ("pinned"). So such an approach would fundamentally be broken. It would slow down all normal dentry lookups, since the *common* case for leaf dentries is that they have a zero count. So it's much better to do it on a "directory/file" basis, on the assumption that files are *mostly* movable (or just freeable). The fact that they aren't always (ie while kept open etc), is likely statistically not all that important. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] block_device_operations prototype changes
On Mon, 27 Aug 2007, Al Viro wrote: > > I have the beginning of that series done and the rest mapped out in enough > details to implement it over this week. If somebody has objections, > questions or comments - yell. >From your description, I have no objections - everything sounds good. My only concern is how painful the patch ends up being (and a worry about whether this will affect a metric truck-load of external modules? That said, I can't really see us worrying about those) Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding a security parameter to VFS functions
On Wed, 15 Aug 2007, David Howells wrote: > > Would you object greatly to functions like vfs_mkdir() gaining a security > parameter? What I'm thinking of is this: > > int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode, > struct security *security) I personally consider this an affront to everythign that is decent. Why the *hell* would mkdir() be so magical as to need something like that? Make it something sane, like a "struct nameidata" instead, and make it at least try to look like the path creation that is done by "open()". Or create a "struct file *" or something. I can imagine having "mkdir()" being passed similar data as "open()" (ie "lookup()"), but I cannot _possibly_ imagine it ever being valid to pass in something totally made-up to just mkdir(), and nothing else. There's something fundamentally wrong there. What makes mkdir() so magical? Also, what about all the other ops? Why is mkdir() special, but not "mknod()"? Why is "mkdir()" special, but not "rmdir()"? Really, none of this seems to make any sense unless you describe what is so magical about mkdir(). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Interface for the new fallocate() system call
On Thu, 29 Mar 2007, Jan Engelhardt wrote: > > I have to disagree, since wrapping it into a struct and copying the struct > in kernelspace from userspace requires more code. Not just more code, but more security issues too. Passing system call arguments by value means that there are no subtle security issues - the value you use is the value you got. But once you pass-by-reference, you have to make damn sure that you do the proper user space accesses and verify the pointer correctly. User-space (aka "user-supplied") pointers are just more dangerous. We obviously can't avoid them, but they need much more care than just a random value directly passed in a register. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix d_path for lazy unmounts
On Wed, 14 Feb 2007, Andreas Gruenbacher wrote: > > Mountpoints are reported relative to the chroot if they are reachable from > the > chroot, and relative to the namespace they are defined in otherwise. This is > big nonsense, but it's unclear to me how to best fix it: Well, it's also what a traditional "pwd" implementation would do, so it's not "nonsense" in that sense. > - don't report unreachable mount points, > - somehow indicate which mountpoints are reachable and which are not, > like by prepending a question flag? We could prepend another '/' (so that you'd have a path that starts with "//"). That's still a legal path, but it's also somethign that even POSIX says is valid to mean something else (eg "//ftp/.." or "//socket/.." to escape into another namespace). But the fact is, some things just want a path. And it's generally *better* to get them a - path that looks ok and starts from '/' than it is to give them something that looks strange and doesn't start from root (because the latter gives many many more possible attack vectors: if somebody actually looks up the path, a bad user can much more easily fake a relative path than fake an absolute one). - the path we've historically always given them. > What's the point in reporting the rootfs at all -- it's never reachable to an > ordinary process? All the paths are generally useful for USER INFORMATION. That's the primary use of paths for anything but "getcwd()". And the primary use for "getcwd()" is to do the same thing that any traditional cwd implementation has done, except faster (and _possibly_ better, but compatibility is more important than extensions - so the "better" is mainly an issue about non-readable or non-executable path component that we can show, and about being able to tell _how_ you got to a point that has multiple ways of getting there). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pipefs unique inode numbers
On Tue, 30 Jan 2007, Jeff Layton wrote: > > Also, that patch would break many 32-bit programs not compiled with large > offsets when run in compatibility mode on a 64-bit kernel. If they were to > do a stat on this inode, it would likely generate an EOVERFLOW error since > the pointer address would probably not fit in a 32 bit field. > > That problem was the whole impetus for this set of patches. Well, we have that problem with the slowly incrementing "last_ino" too. Should we make "last_ino" be "static unsigned int" instead of "long"? Does anybody actually even use the old stat() with 32-bit interfaces? We warn for it, and we've done so for a long time.. I dont' remember people even complaining about the warning, so.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pipefs unique inode numbers
On Tue, 30 Jan 2007, Bodo Eggert wrote: > > change pipefs to use a unique inode number equal to the memory > address unless it would be truncated. I *really* wouldn't want to expose kernel addresses to user space, it just ends up being a piece of data that they shouldn't have. If we have some security issue, this is just too much kernel information that a bad user could get at. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Sat, 20 Aug 2005, Al Viro wrote: > > That looks OK except for > * ncpfs fix is actually missing here Well, the thing is, with the change to page_follow_link() and page_put_link(), ncpfs should now work fine - it doesn't need any fixing any more. It was makign an assumption that used to be incorrect, but that assumption became ok with the calling convention change - now the page_follow_link() and page_put_link() functions work fine even if the page cache might get invalidated in between the calls. So with your patch, things should be fine (and external filesystems will generate warnings on compiles, but should work fine, since the change maintains the ABI even if the API changed - which is pretty unusual). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, 19 Aug 2005, Anton Altaparmakov wrote: > > Yes, sure. I have applied your patch to our 2.6.11.4 tree (with the one > liner change I emailed you just now) and have kicked off a compile. Actually, hold on. The original patch had another problem: it returned an uninitialized "page" pointer when page_getlink() failed. This one should have that fixed, and has converted a few other filesystems. Most of them trivially, but I took the opportunity to just simplify NFS while I was at it, since it now has no reason to need to save off the "struct page *" any more. It's still not tested, but at least I've looked at it a bit more ;) Linus --- diff --git a/fs/autofs/symlink.c b/fs/autofs/symlink.c --- a/fs/autofs/symlink.c +++ b/fs/autofs/symlink.c @@ -12,11 +12,12 @@ #include "autofs_i.h" -static int autofs_follow_link(struct dentry *dentry, struct nameidata *nd) +/* Nothing to release.. */ +static void *autofs_follow_link(struct dentry *dentry, struct nameidata *nd) { char *s=((struct autofs_symlink *)dentry->d_inode->u.generic_ip)->data; nd_set_link(nd, s); - return 0; + return NULL; } struct inode_operations autofs_symlink_inode_operations = { diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h --- a/fs/cifs/cifsfs.h +++ b/fs/cifs/cifsfs.h @@ -83,8 +83,8 @@ extern int cifs_dir_notify(struct file * extern struct dentry_operations cifs_dentry_ops; /* Functions related to symlinks */ -extern int cifs_follow_link(struct dentry *direntry, struct nameidata *nd); -extern void cifs_put_link(struct dentry *direntry, struct nameidata *nd); +extern void *cifs_follow_link(struct dentry *direntry, struct nameidata *nd); +extern void cifs_put_link(struct dentry *direntry, struct nameidata *nd, void *); extern int cifs_readlink(struct dentry *direntry, char __user *buffer, int buflen); extern int cifs_symlink(struct inode *inode, struct dentry *direntry, diff --git a/fs/cifs/link.c b/fs/cifs/link.c --- a/fs/cifs/link.c +++ b/fs/cifs/link.c @@ -92,7 +92,7 @@ cifs_hl_exit: return rc; } -int +void * cifs_follow_link(struct dentry *direntry, struct nameidata *nd) { struct inode *inode = direntry->d_inode; @@ -148,7 +148,7 @@ out: out_no_free: FreeXid(xid); nd_set_link(nd, target_path); - return 0; + return NULL;/* No cookie */ } int @@ -330,7 +330,7 @@ cifs_readlink(struct dentry *direntry, c return rc; } -void cifs_put_link(struct dentry *direntry, struct nameidata *nd) +void cifs_put_link(struct dentry *direntry, struct nameidata *nd, void *cookie) { char *p = nd_get_link(nd); if (!IS_ERR(p)) diff --git a/fs/ext2/symlink.c b/fs/ext2/symlink.c --- a/fs/ext2/symlink.c +++ b/fs/ext2/symlink.c @@ -21,11 +21,11 @@ #include "xattr.h" #include -static int ext2_follow_link(struct dentry *dentry, struct nameidata *nd) +static void *ext2_follow_link(struct dentry *dentry, struct nameidata *nd) { struct ext2_inode_info *ei = EXT2_I(dentry->d_inode); nd_set_link(nd, (char *)ei->i_data); - return 0; + return NULL; } struct inode_operations ext2_symlink_inode_operations = { diff --git a/fs/ext3/symlink.c b/fs/ext3/symlink.c --- a/fs/ext3/symlink.c +++ b/fs/ext3/symlink.c @@ -23,11 +23,11 @@ #include #include "xattr.h" -static int ext3_follow_link(struct dentry *dentry, struct nameidata *nd) +static void * ext3_follow_link(struct dentry *dentry, struct nameidata *nd) { struct ext3_inode_info *ei = EXT3_I(dentry->d_inode); nd_set_link(nd, (char*)ei->i_data); - return 0; + return NULL; } struct inode_operations ext3_symlink_inode_operations = { diff --git a/fs/namei.c b/fs/namei.c --- a/fs/namei.c +++ b/fs/namei.c @@ -501,6 +501,7 @@ struct path { static inline int __do_follow_link(struct path *path, struct nameidata *nd) { int error; + void *cookie; struct dentry *dentry = path->dentry; touch_atime(path->mnt, dentry); @@ -508,13 +509,15 @@ static inline int __do_follow_link(struc if (path->mnt == nd->mnt) mntget(path->mnt); - error = dentry->d_inode->i_op->follow_link(dentry, nd); - if (!error) { + cookie = dentry->d_inode->i_op->follow_link(dentry, nd); + error = PTR_ERR(cookie); + if (!IS_ERR(cookie)) { char *s = nd_get_link(nd); + error = 0; if (s) error = __vfs_follow_link(nd, s); if (dentry->d_inode->i_op->put_link) - dentry->d_inode->i_op->put_link(dentry, nd); + dentry->d_inode->i_op->put_link(dentry, nd, cookie); } dput(dentry); mntput(path->mnt); @@ -2344,15 +2347,17 @@ out: int generic_readlink(struct dentry *dentry, char __user *buffer, int buflen) { struct nameidata nd; - int res; + void *cookie; + nd.dept
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, 19 Aug 2005, Anton Altaparmakov wrote: > > You are throwing away the error return from vfs_readlink(). I suspect you > wanted: > > + cookie = ERR_PTR(res); Yes, thanks. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, 19 Aug 2005, Anton Altaparmakov wrote: > > It does disable link caching. But I didn't make this up. This is exactly > what smbfs uses. I just copied smbfs given ncpfs copies almost everything > smbfs does anyway... Can you test whether the untested test-patch I sent out seems to work for your case that bugs out? You'll probably get a few new "initialization from incompatible pointer type" warnings, but I do believe that they should be harmless at least on normal architectures. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, 19 Aug 2005, Al Viro wrote: > > FWIW, I'd rather take page_symlink(), page_symlink_inode_operations, > page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(), > generic_readlink() and vfs_readlink() to the same place where these guys > would live. They all belong together and none of them has any business > in fs/namei.c. Options: fs/libfs.c or separate library since fs/libfs.c > is getting crowded. Linus, do you have any objections to that or suggestions > on filename here? I'm not sure if this merits a new file or new organization (hey, fs/lib/xxx might be good in theory). In particular, I had actually been hoping to release 2.6.13 today, but this seems like a valid thing to hold things up for - but not if we're going to re-organize things. The one thing that strikes me is that we might actually have less pain if we just changed the calling convention for follow_link/put_link slightly instead of creating a new library function. The existing "page cache" thing really _does_ work very well, and would work fine for NFS and ncpfs too, if we just allowed an extra cookie to be passed along from "follow_link()" to "put_link()". A patch like this (totally untested, and you'd need to update any filesystems that don't use the regular page_follow_link interface) would seem to clean up the mess nicely.. The basic change is that follow_link() returns a error-pointer cookie instead of just zero or error, and that is passed into put_link(). That simple calling convention change solves all problems. It so _happens_ that any old binary code also continues to work (the cookie will be zero, and put_link will ignore it), so it shouldn't even break any unconverted stuff (unless they mix using their own functions _and_ using the helpher functions, which is of course possible). The "shouldn't break nonconverted filesystems" makes me think this is a safe change. Comments? NOTE NOTE NOTE! Let me say again that it's untested. It might not break nonconverted filesystems, but it equally well migth break even the converted ones ;) Linus diff --git a/fs/namei.c b/fs/namei.c --- a/fs/namei.c +++ b/fs/namei.c @@ -501,6 +501,7 @@ struct path { static inline int __do_follow_link(struct path *path, struct nameidata *nd) { int error; + void *cookie; struct dentry *dentry = path->dentry; touch_atime(path->mnt, dentry); @@ -508,13 +509,15 @@ static inline int __do_follow_link(struc if (path->mnt == nd->mnt) mntget(path->mnt); - error = dentry->d_inode->i_op->follow_link(dentry, nd); - if (!error) { + cookie = dentry->d_inode->i_op->follow_link(dentry, nd); + error = PTR_ERR(cookie); + if (!IS_ERR(cookie)) { char *s = nd_get_link(nd); + error = 0; if (s) error = __vfs_follow_link(nd, s); if (dentry->d_inode->i_op->put_link) - dentry->d_inode->i_op->put_link(dentry, nd); + dentry->d_inode->i_op->put_link(dentry, nd, cookie); } dput(dentry); mntput(path->mnt); @@ -2345,14 +2348,17 @@ int generic_readlink(struct dentry *dent { struct nameidata nd; int res; + void *cookie; + nd.depth = 0; - res = dentry->d_inode->i_op->follow_link(dentry, &nd); - if (!res) { + cookie = dentry->d_inode->i_op->follow_link(dentry, &nd); + if (!IS_ERR(cookie)) { res = vfs_readlink(dentry, buffer, buflen, nd_get_link(&nd)); if (dentry->d_inode->i_op->put_link) - dentry->d_inode->i_op->put_link(dentry, &nd); + dentry->d_inode->i_op->put_link(dentry, &nd, cookie); + cookie = ERR_PTR(0); } - return res; + return PTR_ERR(cookie); } int vfs_follow_link(struct nameidata *nd, const char *link) @@ -2395,23 +2401,19 @@ int page_readlink(struct dentry *dentry, return res; } -int page_follow_link_light(struct dentry *dentry, struct nameidata *nd) +void *page_follow_link_light(struct dentry *dentry, struct nameidata *nd) { struct page *page; nd_set_link(nd, page_getlink(dentry, &page)); - return 0; + return page; } -void page_put_link(struct dentry *dentry, struct nameidata *nd) +void page_put_link(struct dentry *dentry, struct nameidata *nd, void *cookie) { if (!IS_ERR(nd_get_link(nd))) { - struct page *page; - page = find_get_page(dentry->d_inode->i_mapping, 0); - if (!page) - BUG(); + struct page *page = cookie; kunmap(page); page_cache_release(page); - page_cache_release(page); } } diff --git a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -993,8 +993,8 @@ struct inod
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, 19 Aug 2005, Linus Torvalds wrote: > > - document this as a fundamental fact, and apply the ncpfs patch. Local >filesystems can still continue to use the generic helper functions >(all other users _are_ local filesystems). Actually, looking at the ncpfs patch, I'd rather not apply that patch as-is. It looks like it will totally disable symlink caching, which would be kind of sad. Somebody willing to do the same thing NFS does? NFS hides away the "struct page *" pointer at the end of the page data, so that when we pass the pointer to the virtual address around, we can trivially look up the "struct page". An alternative is to make the symlink address-space use a gfp_mask of GFP_KERNEL (no highmem), at which point ncpfs_put_link() just becomes something like void ncpfs_put_link(struct dentry *dentry, struct nameidata *nd) { char *addr = nd_get_link(nd); if (!IS_ERR(addr)) page_cache_release(virt_to_page(addr)); } which is pretty ugly, but even simpler than the NFS trick. Anybody? Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, 19 Aug 2005, Linus Torvalds wrote: > > The generic "page cache for symlinks" code does _not_ support invalidating > the cache while it's being used. A local filesystem will obviously never > invalidate the cache at all. > > Hmm.. NFS _does_ use the page cache for symlinks [..] Looking more and more at this, I'm convinced this is it. Basically, page_follow_link_light() and page_put_link() depend on the fact that the page in the page cache is the same one the whole time: page_follow_link_light() will increment the page count of the page it finds at offset 0, and page_put_link() will decrement it. If the page has changed, they increment/decrement different pages. There's two ways to fix this: - document this as a fundamental fact, and apply the ncpfs patch. Local filesystems can still continue to use the generic helper functions (all other users _are_ local filesystems). - make "nameidata" contain not just the virtual addresses of the names, but also have a "struct page *pages[MAX_NESTED_LINKS + 1]", and save away the page there. That will fix ncpfs, and we could then make NFS also use the generic routines. I suspect that #1 is the prudent one. We have a patch already, and we don't want to grow nameidata. I'll commit a comment at the head of page_follow_link_light() too. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug: Bad page state: related to generic symlink code and mmap
On Fri, 19 Aug 2005, Anton Altaparmakov wrote: > > struct page *page; > page = find_get_page(dentry->d_inode->i_mapping, 0); > if (!page) > > BUG(); Something has truncated the mapping. My guess is that you had a cache invalidate event that removed the page from the mapping while it was being used. That might also explain why this only happens for ncpfs. I bet that in the other cases, the mapping was also invalidated, but re-filled immediately, and your strace slowed the other process down enough that it didn't get to re-fill the cache it invalidated or something like that. The fact that it happens only for cross-volume things might also be explained that way - is there something ncpfs does when switching volumes that might trigger a cache invalidate for another volume (either on the client side or the server side - maybe the server tends to try to break the connection for the old volume when you start traversing a new one?) The generic "page cache for symlinks" code does _not_ support invalidating the cache while it's being used. A local filesystem will obviously never invalidate the cache at all. Hmm.. NFS _does_ use the page cache for symlinks, but uses it slightly differently: instead of relying on the page cache entry being the same when freeing the page, it just caches the page it looked up in the page cache (ie "nfs_follow_link()" does look up the page cache, but then hides the page pointer inside the page data itself (uglee), and thus does not depend on the mapping staying the same (nfs_put_link() just takes the page from the symlink data). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] inode->u.nfs_i allocated separately
On Wed, 27 Jun 2001, Alexander Viro wrote: > > We get inode initilization (generic parts) spread all over the place and > sooner or later it's going to bite us, for one thing. I don't really think so. The "struct inode" has become less and less important as far as the VFS layer is concerned, and most of the VFS layer functions handling it are really all just helper functions anyway. Instead of calling "get_empty_inode()", you'd just call "initialize_inode()" instead. The current exceptions to this is the "iget()" family of interfaces, but those are on the killing block anyway, due to other problems. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [PATCH][RFC] inode->u.nfs_i allocated separately
On Wed, 27 Jun 2001, Ben LaHaise wrote: > On Wed, 27 Jun 2001, Linus Torvalds wrote: > > > If we're going to do this for a major filesystem, then I'd really just > > rather see this being done generically during 2.5.x, making "u" be a > > pointer only, and having the generic "iput()" just always free the dang > > thing. > > Are you certain that adding yet another level of pointer indirection here > is a good idea? The whole cache miss and indirection issue is exactly why > almost all systems in the world make use of a VNODE style interface. Oh, I'm not. But if we do this for NFS, then we'd better think hard about whether we should (a) do it at all or (b) do it for everything. The alternative, of course, is to just have different allocation-pools for different filesystems. There's nothing that says that we should be able to re-use an inode allocation done for one filesystem for another filesystem. This would be my favourite approach, in fact. Instead of having struct inode { generic fields union { specific-ext2 struct specific-nfs struct ... } } we could _easily_ have the setup struct ext2_inode { struct inode inode; /* Generic fields */ specific-ext2 struct; /* specific fields */ }; and then when ext2 calls down to the generic VFS layer it just passes &ext2_inode->inode down, and when it gets a "struct inode *" it uses "inode_to_ext2()" to convert it to an ext2 inode pointer. This is what the "struct list_head" thing does, and it works remarkably well. It allows for embedding a list (or a hundred) into any object. The above would take the same approach, and allow embedding an inode (and maybe several) into any object. Advantages: no extra memory use, no indirection, no memory allocation overhead. Disadvantages: ?? Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [PATCH][RFC] inode->u.nfs_i allocated separately
If we're going to do this for a major filesystem, then I'd really just rather see this being done generically during 2.5.x, making "u" be a pointer only, and having the generic "iput()" just always free the dang thing. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Bad interaction between sync and truncate
On Tue, 12 Jun 2001, Alexander Viro wrote: > > We need a bit more - set_blocksize() is another source of the same problem, > AFAICS. How about the following? Makes sense. The "set_blocksize()" thing looks like it could have bit us before - it didn't check "page->mapping", so it could make mapped pages do bad bad things already. Although setting the blocksize on something that is already mounted is obviously bad in itself, so maybe the old problems just weren't a real issue ;) Applied. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Bad interaction between sync and truncate
On Tue, 12 Jun 2001, Alexander Viro wrote: > > OK, let me describe the race again: And let me descibe the BUG again. > * block_flushpage() makes highmem page anonymous. > * at some point, after the buffers are not busy, but before page_launder() > sees the page again we get invalidate_buffers() (normally - from umount()). And this is buggy. We already _have_ code to check that invalidate_buffers shouldn't touch page-cache buffers. That one-liner was just buggy. - if (bh->b_page->mapping) + if (!bh->b_pprev) Problem solved the right way, logically, so that invalidate_buffers() does the same thing regardless of whether the mapping has been removed or not, instead of having pseudo-random behaviour. No? Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Bad interaction between sync and truncate
On Tue, 12 Jun 2001, Alexander Viro wrote: > > invalidate_buffers() will happily take such buffer_head from LRU lists > and put it on free list. With ->b_data still pointing to the highmem > page. Hey, that's easy to fix. We have a real bug, and the real bug is that we have them on the dirty list, and we already have this /* Part of a mapping? */ if (bh->b_page->mapping) continue; special case. It's just that the test is wrong. It should probably be something like /* Not hashed? */ if (list_empty(bh->b_hash) continue; instead (yeah, yeah, I know the buffer heads don't use the list structures, I wish it did. The real way to check for the bh being hashed to check "bh->b_pprev" for NULL, but I wish it wasn't). Don't blame the VM layer, I think page_launder() and friends do the right thing. It's the act of invalidation that just isn't supposed to look at non-hashed buffers. (That makes a lot more sense than the current test anyway: if the buffer isn't hashed, then it isn't reachable by "dev", so it clearly also shouldn't be invalidated by "dev"). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Bad interaction between sync and truncate
On Tue, 12 Jun 2001, Alexander Viro wrote: > > On Tue, 12 Jun 2001, Linus Torvalds wrote: > > > release the page cache part of the page count, but we'll leave the buffers > > untouched because somebody is still using them. It's really nothing but a > > "the page is still busy, we'll delay the free" thing. > > Erm... Sorry, but what is going to free them? It will not happen until > all buffer_heads are on the free list. And AFAICS it won't happen until > we call set_blocksize() or invalidate_buffers(). What? They'll get free'd the next time through the page_launder() code, when there is no longer anybody trying to push them out or otherwise keeping their counts elevated. "page->buffers". That's the important part. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Bad interaction between sync and truncate
On Tue, 12 Jun 2001, Alexander Viro wrote: > > On Tue, 12 Jun 2001, Steve Lord wrote: > > > We now return to truncate_complete_page() and remove the page from the inode, > > the only reference to the page is now the buffer head. > > The real problem is in block_flushpage() - we blindly convert page into > buffer cache one, polluting the latter with highmem pages. Tell me the problem again.. Yes, we will convert a mapped page into a "buffer page", but we will _not_ add any of the buffers to any LRU lists or anything like that, so the buffers won't be reachable from anything that they weren't reachable from before already. So either the buffers had been reachable before - in which case the anonymization adds no new cases, or they had not been reachable before - in which case the anonymization will basically "lose" the buffers until the page can later be reclaimed. > Do we really need that logics at all? Reuse of page cache pages for > buffer cache, that is... We _do_ need the logic. And don't think of it as a "reuse of page cache pages" issue. Think of it as a pure memory management issue - we will release the page cache part of the page count, but we'll leave the buffers untouched because somebody is still using them. It's really nothing but a "the page is still busy, we'll delay the free" thing. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Tue, 22 May 2001, Andreas Dilger wrote: > > Actually, the LVM snapshot interface has (optional) hooks into the filesystem > to ensure that it is consistent at the time the snapshot is created. Note that this is still fundamentally a broken interface: the filesystem may not _have_ a block device underneath it, yet you might very well like to do defragmentation and backup none-the-less. Also, lvm snapshots are fundamentally limited to read-only data, which means that the LVM interfaces cannot be used for defragmentation and lazy fsck etc anyway. You _have_ to do those at a filesystem level. disk snapshots are useful, but they are not the answer. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Tue, 22 May 2001, Jan Harkes wrote: > > something like, > > ssize_t kioctl(int fd, int type, int cmd, void *inbuf, size_t inlen, > void *outbuf, size_t outlen); > > As far as functionality and errors it works like read/write in a single > call, pretty much what Richard proposed earlier with a new 'transaction' > syscall. Maybe type is not needed, and cmd can be part of the inbuf in > which case it would be identical. I'd rather have type and cmd there, simply because right now the "cmd" passed in to the ioctl is not well-defined, as several different drivers use the same numbers for different things (which is why I want to expand that to to get uniqueness). Also, I think the cmd is separate from the data, so I don't think it necessarily makes sense to mix the two. Even if we want to have an ASCII command, I'd think that should be separate from the arguments, ie we'd have ssize_t kioctl(int fd, const char *cmd, const void *inbuf ... instead of trying to mix them. This is especially true as the "inbuf" would be a user-mode pointer, while "cmd" would come from kernel space (whether in the form of a number pair or as a kernel string). Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Tue, 22 May 2001, Matthew Wilcox wrote: > > > command + rw-transaction: "dear device please mangle this data" > >(crypto processors come to mind...) > > I can't think of a reasonable tool-based approach to this, but I can > definitely see that a program could use this well. It simply requires > that you use the filp to store your state. Nope. You can (and people do, quite often) share filps. So you can't associate state with it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Mon, 21 May 2001, Alan Cox wrote: > > > Sure. But we have to do two syscalls only if ioctl has both in- and out- > > arguments that way. Moreover, we are talking about non-trivial in- arguments. > > How many of these are in hotspots? > > There is also a second question. How do you ensure the read is for the right > data when you are sharing a file handle with another thread.. > > ioctl() has the nice property that an in/out ioctl is implicitly synchronized I don't think we can generically replace ioctl's with read-write, and we shouldn't bend over backwards even _trying_. The important thing would be to give them more structure, and as far as I'm personally concerned I don't even care if the user-level access method ends up being the same old thing. After all, we have magic numbers everywhere: even a system call uses magic numbers for the syscall entry numbering. The thing that makes system call numbers nice is that the number gets turned into a more structured thing with proper type checking and well-defined semantics very very early on indeed. It shouldn't be impossible to do the same thing to ioctl numbers. Nastier, yes. No question about it. But we don't necessarily have to redesign the whole approach - we only want to re-design the internal kernel interfaces. That, in turn, might be as simple as changing the ioctl incoming arguments of into a structure like . Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Mon, 21 May 2001, Ingo Molnar wrote: > > On Sun, 20 May 2001, Alexander Viro wrote: > > > Linus, as much as I'd like to agree with you, you are hopeless > > optimist. 90% of drivers contain code written by stupid gits. > > 90% of drivers contain code written by people who do driver development in > their spare time, with limited resources, most of the time serving as a > learning excercise. And they do this freely and for fun. Accusing them of > being 'stupid gits' is just micharacterising the situation. I would disagree with both of you. The problem is not whether people do it with limited resources or time, or whether they are stupid or not. The problem is that if you expect to get nice code, you have to have nice interfaces and infratructure. And ioctl's aren't it. The reason we _can_ write beautiful filesystems these days is that the VFS layer _supports_ it. In fact, the VFS layer has tons of infrastructure and structure that makes it _hard_ to write bad filesystem code (which is not to say that we don't have ugly code there - but much of it is due to historically not having had quite the same level of infrastructure). If we had nice infrastructure to make ioctl's more palatable, we could probably make do even with the current binary-number interfaces, simply because people would use the infrastructure without ever even _seeing_ how lacking the user-level accesses are. But that absolutely _requires_ that the driver writers should never see the silly "pass a random number and a random argument type" kind of interface with no structure or infrastructure in place. Because right now even _good_ programmers make a mess of the fact that they get passed a bad interface. Think of it this way: the user interface to opening a file is "open()" with pathnames and magic flags. But a filesystem never even _sees_ that interface, it sees a very nicely structured setup where all the argument parsing and locking has already been done for it, and the magic flags don't even exist any more as far as the low-level FS is concerned. Which is why filesystems _can_ be clean. In contrast, ioctl's are passed through directly, with no help to make them clean. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
Davem, check the last thing, please. On Sun, 20 May 2001, Alexander Viro wrote: > > On Sun, 20 May 2001, Linus Torvalds wrote: > > > > How about moratorium on new ioctls in the meanwhile? Whatever we do in > > > fs/ioctl.c, it _will_ take time. > > > > Ehh.. Telling people "don't do that" simply doesn't work. Not if they can > > do it easily anyway. Things really don't get fixed unless people have a > > certain pain-level to induce it to get fixed. > > Umm... How about the following: you hit delete on patches that introduce > new ioctls, I help to provide required level of pain. Deal? It still doesn't work. That only makes people complain about my fascist tendencies. See the thread about device numbers, where Alan just says "ok, I'll do it without Linus then". The whole point of open source is that I don't have that kind of power. I can only guide, but the most powerful guide is by guiding the _design_, not micro-managing. > BTW, -pre4 got new bunch of ioctls. On procfs, no less. I know. David has zero taste. Davem, why didn't you just make new entries in /proc/bus/pci and let people do "mmap(/proc/bus/pci//mem)" instead of having idiotic ioctl's to set "this is a IO handle" and "this is a MEM handle"? This particular braindamage is not too late to fix.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, 20 May 2001, Alexander Viro wrote: > > On Sun, 20 May 2001, Matthew Wilcox wrote: > > > On Sun, May 20, 2001 at 03:11:53PM -0400, Alexander Viro wrote: > > > Pheeew... Could you spell "about megabyte of stuff in ioctl.c"? > > > > No. > > > > $ ls -l arch/*/kernel/ioctl32*.c > > -rw-r--r--1 willywilly 22479 Jan 24 16:59 >arch/mips64/kernel/ioctl32.c > > -rw-r--r--1 willywilly 109475 May 18 16:39 >arch/parisc/kernel/ioctl32.c > > -rw-r--r--1 willywilly 117605 Feb 1 20:35 >arch/sparc64/kernel/ioctl32.c > > > > only about 100k. > > You are missing all x86-only drivers. Now, the point is that it _is_ doable, and by doing it in one standard place (instead of letting each architecture fight it on its own) we'd expose the problem better, and maybe get rid of some of those architecture-specific ones. For example, right now the fact that part of the work _has_ been done by things like Sparc64 has not actually had any advantages: the sparc64 work has not allowed people to say "let's try to merge this work", because it has not been globally relevant, and a sparc64-only file has not been a single point of contact that could be used to clean up things. In contrast, a generic file has the possibility of creating new VFS or device-level interfaces. You can catch block device ioctl's and turn them into proper block device requests - and send them down the right request queue. Suddenly a block device driver doesn't just get READ/WRITE requests, it gets EJECT/SERIALIZE requests too. Without having to add magic ioctl's that are specific to just one device driver. So by having a common point of access, you can actually encourage _fixing_ some of the problems. Historically, sparc64 etc have not been able to do that - they can only try to convert different ioctl's into another format and then re-submitting it.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, 20 May 2001, Alexander Viro wrote: > > Pheeew... Could you spell "about megabyte of stuff in ioctl.c"? I agree. But it would certainly force people to think about this. And it may turn out that a lot of it can be streamlined, and not that much ends up being used very much. It would also allow a single place of catching the generic ones, and as such be a place to try to make things like the network ioctl's more regular: setting things like network device duplex with _real_ interfaces instead of hiding it in ioctl routines. Also, note that many ioctl's actually do have fairly regular meaning, and that it _is_ possible to catch a number of them with those regular things: switch (_IOC_TYPE(number)) { case 'x': xfs_ioctl(..); and actually try to enforce the things that Documentation/ioctl-number.txt tries to document. And make the clashes _explicit_ and thus make people have more incentive to really try to fix it. > How about moratorium on new ioctls in the meanwhile? Whatever we do in > fs/ioctl.c, it _will_ take time. Ehh.. Telling people "don't do that" simply doesn't work. Not if they can do it easily anyway. Things really don't get fixed unless people have a certain pain-level to induce it to get fixed. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, 20 May 2001, Russell King wrote: > > On Sun, May 20, 2001 at 11:46:33AM -0700, Linus Torvalds wrote: > > Nobody will expect the above to work, and everybody will agree that the > > above is a BUG if the read() call will actually follow the pointer. > > I didn't say anything about read(). I said write(). Obviously it > wouldn't work for read()! No, but the point is, everybody _would_ consider it a bug if a low-level driver "write()" did anything but touched the explicit buffer. Code like that would not pass through anybody's yuck-o-meter. People would point fingers and say "That is not a legal write() function". Anybody who tried to make write() follow pointers would be laughed at as a stupid git. Anybody who makes "ioctl()" do the same is just following years of standard practice, and the yuck-o-meter doesn't even register. THAT is the importance of psychology. Technology is meaningless. What matters is how people _think_ of it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, 20 May 2001, David Woodhouse wrote: > > If that had been done right the first time, you wouldn't have had to either. > For that matter, it's often the case that if the ioctl had been done right > the first time, nobody would have had to fix it up for any architecture. The problem with ioctl's is, let me repeat, not technology. It's people. ioctl's are a way to do ugly things. That's what they have ALWAYS been. And because of that, people don't care about following the rules - if ioctl's followed the rules, they wouldn't _be_ ioctls in the first place, but instead have a good interface (say, read()/write()). Basically, ioctl's will _never_ be done right, because of the way people think about them. They are a back door. They are by design typeless and without rules. They are, in fact, the Microsoft of UNIX. The only way to fix ioctl's is to force people to think about them in another way. Because if you don't, there is always going to be another driver writer who adds his own ioctl because it's the easy way to do whatever he wants without giving it a second of _design_ thought. Now, a good way to force the issue may be to just remove the "ioctl" function pointer from the file operations structure altogether. We don't have to force peopel to use "read/write" - we can just make it clear that ioctl's _have_ to be wrapped, and that the only ioctl's that are acceptable are the ones that are well-designed enough to be wrappable. So we'd have a "linux/fs/ioctl.c" that would do all the wrapping, and would also be able to do all the stuff that is currently done by pretty much every single architecture out there (ie emulation of ioctl's for different native modes). It would probably not be that horrible. Many ioctl's are probably not all that much used by any real programs any more. The most common ones by far are the tty ones - and the truly generic ones like "FIONREAD" that it actually would make sense to generalize more. Catching stuff like EJECT at a higher layer and turning THOSE kinds of things into real block device operations would clean up drivers and make them more uniform. Would fs/ioctl.c be an ugly mess of some special cases? Yes. But would that make the ugliness explicit and possibly easier to try to manage and fix? Very probably. And it would mean that driver writers could not just say "fuck design, I'm going to do this my own really ugly way". Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, 20 May 2001, Russell King wrote: > > On Sat, May 19, 2001 at 08:26:20PM -0700, Linus Torvalds wrote: > > You're missing the point. > > I don't think Richard is actually. I think Richard has hit a nail > dead on its head. > > > It's ok to do "read()/write()" on structures. > > Ok, we can read()/write() structures. So someone invents the following > structure: > > struct foo { > int cmd; > void *data; > } foo; > > Now they use write(fd, &foo, sizeof(foo)); Haven't they just swapped > the ioctl() interface for write() instead? Wrong. Nobody will expect the above to work, and everybody will agree that the above is a BUG if the read() call will actually follow the pointer. Read my email. And read the last line: "psychology is important". Step #1 in programming: understand people. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, 19 May 2001, Richard Gooch wrote: > > Matthew Wilcox writes: > > On Sat, May 19, 2001 at 10:22:55PM -0400, Richard Gooch wrote: > > > The transaction(2) syscall can be just as easily abused as ioctl(2) in > > > this respect. > > > > But read() and write() cannot. > > Sure they can. I can pass a pointer to a structure to either of them. You're missing the point. It's ok to do "read()/write()" on structures. In fact, people do that all the time (and then they complain about the file not being portable ;) The problem with ioctl is that not only are people passing ioctl's pointers to structures, but: - they're not telling how big the structure is - the structure can have pointers to other places - sometimes it modifies the structure passed in None of which are "network-nice". Basically, ioctl() is historically used as a "pass any crap into driver , and the driver - and ONLY the driver - will know what to do with it". And when _only_ a driver knows what the arguments mean, upper layers can't encapsulate them. Upper layers cannot make a packet of the argument and send it over the network to another machine. Upper layers cannot do sanity-checking on things like "is this argument a valid pointer". Which means, for example, that not only can you not send the ioctl arguments anywhere, but ioctl's have also historically been a hot-bed of bugs. Example traditional ioctl bugs: use kernel pointers to access the argument (because it just happens to work on x86, never mind the fact that if the argument is bad you'll get a kernel oops and/or a serious security error). Other example: different drivers/f ilesystems implementing the same ioctl, but disagreeing on what the argument means (is it a pointer to an integer argument, or the integer itself?). Now, the advantage of using read()/write() is (a) that it's unambiguous where the argument comes from and how big it is and (b) because of that the _psychology_ is different. You don't get into this "pass random crap around, let the kernel modify user data structures directly" mentality. And psychology is important. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion codeinuserspace
On Sat, 19 May 2001, Brad Boyer wrote: > > If I understand the status of stuff correctly, I think this would make it > a lot more painful to admin if it became a requirement to use initrd on > everything just to be able to boot. Don't get too hung up on initrd. Symbolic links really _are_ workable ways to basically cache the information across boots, and the real problems are elsewhere. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sat, 19 May 2001, Alan Cox wrote: > > > Now that I'm awake and refreshed, yeah, that's awful. But > > echo "hot-add,slot=5,device=/dev/sda" >/dev/md0/control *is* sane. Heck, > > the system can even send back result codes that way. > > Only to an English speaker. I suspect Quebec City canadians would prefer a > different command set. I was waiting for the "anglo-saxon" argument. I don't think it's a valid argument. You already have "/dev". You already have english names for the numbers in ioctl's (and let's not be mentally dishonest and say "numbers are cross-cultural", because NOBODY MUST EVER USE THE RAW NUMBERS - you have to use the anglo-saxon #define'd names because the numbers aren't even cross-platform on Linux, much less portable to other systems). So the "English is bad" argument is a complete non-argument. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion codein userspace
On Sat, 19 May 2001, Ben LaHaise wrote: > > 1. Generic lookup method and argument parsiing (fs/lookupargs.c) Looks sane. > 2. Restricted block device (drivers/block/blkrestrict.c) This is not very user-friendly, but along with symlinks this makes perfect sense. It would make partition handling a _lot_ simpler. Note, however, that I think the "restricted block device" is a much more generic issue than just block devices. I've already discussed with Alan the possibility of making _all_ file descriptors have the notion of "restrictions", notably the "start, end" kind of things. It is very useful for other things too - imagine opening /dev/mem, and wanting to pass a restricted portiong of it to other processes with the standard file descriptor passing facilities (think "secure DGA" for the X server, but also think untrusted users that can read parts of shared files etc - a suid program that opens a file, restricts it, drops privileges and knows that the program can only access a specific part of the file) > 3. Userspace partition code proposal Yes and no. I absolutely thihnk the idea that users actually _using_ these names is a horrible one, and fraught with potential for much too easy mistakes that end up being disastrous. But having symlinks that are created by a special program would be ok. [ Also, note how symlinks would make the point of initrd completely moot. You don't have to have initrd to initialize the thing, you can initialize the thing at installation time and when doing fdisk, and the symlinks would act as the permanent markers. ] HOWEVER, you have to realize that there are serious security and maintenance issues here, and I think your idea breaks down completely because of that. The thing is, you only have permissions on a "per-object" basis, and it's common practice to have different permissions for different partitions. Your scheme does not allow this. Which means that it is fundamentally broken. Sorry. So don't go overboard. The name-based thing is useful, but it's useful for only certain things. And you must _never_ forget the security and management issues. For example, if you can open a serial port in the first place, you can set its baud-rate. So it's ok to make baud-rate part of the name. And once you have permission to read /dev/fd0 it doesn't make sense to limit you to one particular format. So it's ok to have the disk format be part of the name. But it's not possible to make the partition be a "name" issue. Because while you obviously need different names, you _also_ need different permissions. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)
On Sat, 19 May 2001, Alexander Viro wrote: > > Folks, before you get all excited about cramming side effects into > open(2), consider the following case: Your argument is stupid, imnsho. Side-effects are perfectly fine if they are _local_ to the file descriptor. Your example is contrieved and idiotic. Filename extensions would not replace ioctl's. But they are wonderful ways to avoid unnecessary binary name-spaces, like the ones we have with "callout" TTY names, and the one that the fb people had. For example, do a "ls -l /dev/fd0*", and ponder. Also, realize that we have these hard-coded names in _addition_ to the magic ioctl to set even more parameters. These are all stupid and bad, and it would have been a _lot_ cleaner to be able to do open("/dev/fd0/H1440", O_RDWR).. or open("/dev/fd0/HD,18,85", O_RDWD) to open special non-standard high-density modes. We already did this, in a very limited and stupid way, by encoding the minor number and generating a standard naming scheme. We can do the same thing in a _much_ more generic way by just realizing that we wanted the open to be name-based in the first place. These are _not_ side effects. They are very much naming conventions. If I want to open a the floppy in one of the special extended modes, it makes a LOT more sense to just open it with the naming, than to open a "generic" floppy device only to them use a magic and very unreadable ioctl to set the mode of the device. In short, I don't buy your arguments for one single second. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Alexander Viro wrote: > Umm... OK. Let me put it that way: > * uptodate pages should never become non-uptodate. Agreed. That "ClearPageUptodate()" thing definitely looks like a bug. > * we do multiple read requests on the same buffer. Right now, yes. And that is a bad bug. > * the only thing saving us from data loss is that non-uptodate > pages never lose buffers when they have data newer than on disk (uptodate > pages do, but they never have data _older_ than on disk). Thus we can > afford rereading a buffer on non-uptodate page if bh was lost and never > have to reread it on uptodate page. Right. Basically we will drop the buffers only after they have been written out, so normal read/write can never lose data. It's only the case of shared mmap's that can lose data due to the stupid "let's read it in from disk again" thing. > * look at the explanation above and see if it looks brittle. It > really needs to be documented Oh, I certainly agree. And I also agree that we should work on making the "create_page_buffers" thing more easily used - the code duplication in the different parts of fs/buffer.c is quite horrible. It tends to be the same kind of logic, except it has small differences in four-five different versions. I think it would be better to do the code part first, actually: once that is done, a single comment above "create_page_buffers" will explain everything that is going on.. > That's what makes me unhappy about the current situation + obvious > fixes. It works, but the proof is... well, not pretty. Oh, agreed. I think we should clean up the code. I looked at it yesterday, and it didn't look all that horribly bad, but I lost interest. I don't know if it is worth doing before 2.4.x, as the current code certainly should work with the small changes already proposed. Linus - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]