Re: [patch] epoll use a single inode ...
Hi, > Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise > exceptions, doesn't have any side effects, takes only enough CPU to > decode the address, and is ignored if it would have to do anything > other than load the cacheline from RAM. Prefetch streams are halted > when they reach the end of a page boundary (no trapping to the MMU) > and if the TLB entry isn't present then they would asynchronously > abort. It depends on the implementation and the HID bit settings. Some do walk the MMU hashtable if it isnt in the TLB. Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
> OK, 200 cycles... > > But what is the cost of the conditional branch you added in prefetch(x) ? Much less than the tablewalk. On ppc64 a tablewalk of an address that is not populated in the hashtable will incur 2 cacheline lookups (primary and secondary buckets). This plus the MMU state machine overhead adds up. Cue Linus rant about PowerPC MMU :) Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
> Yeah, I'm not at all surprised. Any implementation of "prefetch" that > doesn't just turn into a no-op if the TLB entry doesn't exist (which makes > them weaker for *actual* prefetching) will generally have a hard time with > a NULL pointer. Exactly because it will try to do a totally unnecessary > TLB fill - and since most CPU's will not cache negative TLB entries, that > unnecessary TLB fill will be done over and over and over again.. Yeah this is exactly what we were seeing :) Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Mar 08, 2007, at 02:24:04, Eric Dumazet wrote: Kyle Moffett a écrit : Prefetching is also fairly critical on a Power4 or G5 PowerPC system as they have a long memory latency; an L2-cache miss can cost 200+ cycles. On such systems the "dcbt" prefetch instruction brings in a single 128-byte cacheline and has no serializing effects whatsoever, making it ideal for use in a linked-list- traversal inner loop. OK, 200 cycles... But what is the cost of the conditional branch you added in prefetch (x) ? if (!x) return; (correctly predicted or not, but do powerPC have a BTB ?) Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise exceptions, doesn't have any side effects, takes only enough CPU to decode the address, and is ignored if it would have to do anything other than load the cacheline from RAM. Prefetch streams are halted when they reach the end of a page boundary (no trapping to the MMU) and if the TLB entry isn't present then they would asynchronously abort. This is a suitable prefetch implementation for G5: #define prefetch(x) __asm__ __volatile__ ("dcbt 0, %0"::"r"(x)); On GCC4+ the "__builtin_prefetch" built-in functions is probably mildly better tuned for most architectures: #define PREFETCH_RD 0 #define PREFETCH_WR 1 #define PREFETCH_LOCALITY_NONE 0 #define PREFETCH_LOCALITY_LOW 1 #define PREFETCH_LOCALITY_MED 2 #define PREFETCH_LOCALITY_HIGH 3 #define prefetch(x, args...) __builtin_prefetch(x ,##args) The pertinent GCC docs: This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed. The value of addr is the address of the memory to prefetch. There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read. The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three. for (i = 0; i < n; i++) { a[i] = a[i] + b[i]; __builtin_prefetch ([i+j], 1, 1); __builtin_prefetch ([i+j], 0, 1); /* ... */ } Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next will not fault if p->next is not a valid address, but evaluation will fault if p is not a valid address. If the target does not support data prefetch, the address expression is evaluated if it includes side effects but no other code is generated and GCC does not issue a warning. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
+ its done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global Pedantry: it's and don't? -Bob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Thu, 08 Mar 2007 08:24:04 +0100, Eric Dumazet said: > > But what is the cost of the conditional branch you added in prefetch(x) ? > > if (!x) return; > > (correctly predicted or not, but do powerPC have a BTB ?) > > About the NULL 'potential problem', maybe we could use a dummy nil (but > mapped) object, and use its address in lists, ie compare for instead of > NULL. This would avoid : > > - The conditional test in some prefetch() implementations > - The potential TLB problem with the NULL value. You avoid those two, but add the very high likelyhood that a statement of the form 'if (!x) {}' will creep back in and bug out. I doubt that changing the form of a basic C idiom to save a few cycles is worth it, especially when the (!x) form can be tested without a memory fetch, but (x != nil) may require fetching 'nil' (remember - the x86 is a very popular chipset, but register starved - if the loop is any size at all, 'nil' may require a reload each time around, costing you the win you thought you had pgpWRgqXbzEAw.pgp Description: PGP signature
Re: [patch] epoll use a single inode ...
On Thu, 8 Mar 2007, Eric Dumazet wrote: > > Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> > CC: Christoph Hellwig <[EMAIL PROTECTED]> > CC: Linus Torvalds <[EMAIL PROTECTED]> Acked-by: Linus Torvalds <[EMAIL PROTECTED]> Except you should fix the subject line when you send it out to Andrew ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
I'm sorry about complaining again, but.. > > + /* > + * Some filesystems want to provide dentry's pathname themselves, > + * instead of pre-building names at dentry creation. > + */ It's not _some_ filesystems. If real filesystem did this we'd be in horrible trouble. It's "synthetic, non-mountable" filesystem. Except for this the patch looks fine to me, thanks Eric! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Thanks again for your feedback Christoph :) I think I addressed all your remarks. Thank you [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to build a pathname for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> CC: Christoph Hellwig <[EMAIL PROTECTED]> CC: Linus Torvalds <[EMAIL PROTECTED]> CC: Davide Libenzi CC: Al Viro <[EMAIL PROTECTED]> Documentation/filesystems/Locking |2 ++ Documentation/filesystems/vfs.txt | 12 +++- fs/dcache.c |7 +++ fs/pipe.c | 15 +-- include/linux/dcache.h|1 + net/socket.c | 17 ++--- 6 files changed, 40 insertions(+), 14 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-08 11:57:41.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt 2007-03-08 10:14:38.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt 2007-03-08 10:34:34.0 +0100 @@ -827,7 +827,7 @@ This describes how a filesystem can over operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here. These methods may be set to NULL, as they are either optional or -the VFS uses a default. As of kernel 2.6.13, the following members are +the VFS uses a default. As of kernel 2.6.22, the following members are defined: struct dentry_operations { @@ -837,6 +837,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)(struct dentry *, char *, int); }; d_revalidate: called when the VFS needs to revalidate a dentry. This @@ -859,6 +860,15 @@ struct dentry_operations { VFS calls iput(). If you define this method, you must call iput() yourself + d_dname: called when the pathname of a dentry should be generated. + Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay + pathname generation. (Instead of doing it when dentry is created, + its done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global + dcache hash, so their hash should be an invariant. As no lock is + held, d_dname() should not try to modify the dentry itself, unless + appropriate SMP safety is used. + Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory. --- linux-2.6.21-rc3/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 @@ -15,6 +15,7 @@ prototypes: int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); locking rules: none have BKL @@ -25,6 +26,7 @@ d_compare:no yes no no d_delete: yes no yes no d_release: no no no yes d_iput:no no no yes +d_dname: no no no no --- inode_operations --- prototypes: --- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-08 11:57:41.0 +0100 @@ -1823,6 +1823,13 @@ char *
Re: [patch] epoll use a single inode ...
On Thu, Mar 08, 2007 at 10:42:21AM +0100, Eric Dumazet wrote: > @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str > struct vfsmount *rootmnt; > struct dentry *root; > > + if (dentry->d_op && dentry->d_op->d_dname) > + return (dentry->d_op->d_dname)(dentry, buf, buflen); Please don't put braces around the function pointer call. Also I think we really want a comment here why we need this call. > +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) normally this would be: static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) > +static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen) same here. > - char name[32]; > > - this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino); > - this.name = name; > + this.len = 0; > + this.name = ""; > this.hash = 0; Can you add a helper to init this one? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Thursday 08 March 2007 09:56, Christoph Hellwig wrote: > This patch needs a lot more documentation. It needs some really big > comments on why this should never ever be used for a real filesystem > (real as in user mountable), and probably add an assert for that > invariant somewhere. Please also update Documentation/filesystems/Locking > and Documentation/filesystems/vfs.txt for it. Thank you for your feedback Christoph Here is the version I was about to push to -mm, so if nobody complains, I ask Andrew to push it to mm so that it can reach 2.6.22 target [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to build a pathname for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> CC: Christoph Hellwig <[EMAIL PROTECTED]> CC: Linus Torvalds <[EMAIL PROTECTED]> CC: Davide Libenzi CC: Al Viro <[EMAIL PROTECTED]> Documentation/filesystems/Locking |2 ++ Documentation/filesystems/vfs.txt | 12 +++- fs/dcache.c |3 +++ fs/pipe.c | 12 +--- include/linux/dcache.h|1 + net/socket.c | 13 ++--- 6 files changed, 36 insertions(+), 7 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-08 10:14:38.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt 2007-03-08 10:14:38.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt 2007-03-08 10:34:34.0 +0100 @@ -827,7 +827,7 @@ This describes how a filesystem can over operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here. These methods may be set to NULL, as they are either optional or -the VFS uses a default. As of kernel 2.6.13, the following members are +the VFS uses a default. As of kernel 2.6.22, the following members are defined: struct dentry_operations { @@ -837,6 +837,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; d_revalidate: called when the VFS needs to revalidate a dentry. This @@ -859,6 +860,15 @@ struct dentry_operations { VFS calls iput(). If you define this method, you must call iput() yourself + d_dname: called when the pathname of a dentry should be generated. + Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay + pathname generation. (Instead of doing it when dentry is created, + its done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global + dcache hash, so their hash should be an invariant. As no lock is + held, d_dname() should not try to modify the dentry itself, unless + appropriate SMP safety is used. + Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory. --- linux-2.6.21-rc3/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 @@ -15,6 +15,7 @@ prototypes: int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)((struct dentry *dentry, char *buffer, int buflen); locking rules: none have BKL @@ -25,6 +26,7 @@ d_compare:no yes no no d_delete: yes no yes no d_release: no
Re: [patch] epoll use a single inode ...
This patch needs a lot more documentation. It needs some really big comments on why this should never ever be used for a real filesystem (real as in user mountable), and probably add an assert for that invariant somewhere. Please also update Documentation/filesystems/Locking and Documentation/filesystems/vfs.txt for it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: No, I just checked, and Intel's own optimization manual makes it clear that you should be careful. They talk about performance penalties due to resource constraints - which makes tons of sense with a core that is good at handling its own resources and could quite possibly use those resources better to actually execute the loads and stores deeper down the instruction pipeline. Certainly you should be careful -- and that usually means leaving it up to the compiler. But hinting to the compiler can help; there may be an analogue of the (un)likely macros waiting to be implemented for loop prefetch. And while out-of-order execution and fancy hardware prefetch streams greatly reduce the need for explicit prefetch in general code, there's no substitute for the cache _bypassing_ instructions when trying to avoid excessive cache eviction in DDoS situations. For instance, if I wind up working on a splay-tree variant of Robert Olsson's trie/hash work, I'll try to measure the effect of using SSE2 non-temporal stores to write half-open connections to the leaves of the tree. That may give some additional improvement in the ability to keep servicing real load during a SYN flood. So it's not just 3DNow! making AMD look bad, or Intel would obviously suggest people use it out of the wazoo ;) Intel puts a lot of effort into educating compiler writers about the optimal prefetch insertion strategies for particular cache architectures. At the same time, they put out the orange cones to warn people off of hand-tuning prefetch placement using micro-benchmarks. People did that when 3DNow! first came out, with predictable consequences. > XScale gets it right. Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. Calling that "getting it right" is ludicrous. If anything, it gets things so wrong that prefetching is *required* for good performance. That's not an accident. Hardware prefetch units cost a lot in power consumption. Omitting the hardware prefetch unit and drastically simplifying the pipeline is how they got a design whose clock they could crank into the stratosphere and still run on battery power. And in the network processor space, they can bolt a myriad of on-chip microengines and still have some prayer of accurately simulating the patterns of internal bus cycles. Errors in simulation can still be fixed up with prefetch instruction placement to put memory accesses from the XScale core into phases where the data path processors aren't working so hard. Moreover, because they're embedded targets and rarely have to run third-party binaries originally compiled for older cores, it didn't really cost them anything to say, "Sorry, this chip's performance is going to suck if your compiler's prefetch insertion isn't properly tuned." The _only_ cost is a slightly less dense instruction stream. That's not trivial but it's manageable; you budget for it, and the increase in I-cache power consumption is more than made up for by the reduction in erroneous data prefetches (hardware prefetch gets it wrong a substantial fraction of the time). I'm talking about real CPU's with real memory pipelines that already do prefetching in hardware. The better the core is, the less the prefetch helps (and often the more it hurts in comparison to how much it helps). The more sophisticated the core is, the less software prefetch instructions help. But more sophisticated isn't always better; it depends on your target applications. But if you mean "doesn't try to fill the TLB on data prefetches", then yes, that's generally the right thing to do. AOL. > (Oddly, Prescott seems to have initiated a page table walk on DTLB miss > during software prefetch -- just one of many weird Prescott flaws.) Netburst in general is *very* happy to do speculative TLB fills, I think. Design by micro-benchmark. :-) They set out to push the headline MHz and the real memory bandwidth to the limit in Prescott, and they succeeded (data at http://www.digit-life.com/articles2/rmma/rmma-p4.html). At a horrible cost in power per clock, and no gain in real application performance. So NetBurst died a horrible death, and now we have "Intel Core" -- P6 warmed over, with caches sized such that for most applications the second core ought to be used solely to soak up control path overheads. Windows runs better on dual-core machines because the NT kernel will happily eat an entire core doing memory bookkeeping. Linux could take a hint here and use the second core largely for interrupt handling and force-priming the L2 cache on task switch. (Prefetch instructions aren't much use here, precisely because they give up on DTLB miss.) Any kernel code paths that are known to stall a lot because of usually-cold-cache access patterns (TCP connection establishment, for instance) can also be punted over to the second core. If you're feeling industrious, use non-temporal memory accesses
Re: [patch] epoll use a single inode ...
On Thu, 8 Mar 2007, Eric Dumazet wrote: Signed-off-by: Eric Dumazet [EMAIL PROTECTED] CC: Christoph Hellwig [EMAIL PROTECTED] CC: Linus Torvalds [EMAIL PROTECTED] Acked-by: Linus Torvalds [EMAIL PROTECTED] Except you should fix the subject line when you send it out to Andrew ;) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Thu, 08 Mar 2007 08:24:04 +0100, Eric Dumazet said: But what is the cost of the conditional branch you added in prefetch(x) ? if (!x) return; (correctly predicted or not, but do powerPC have a BTB ?) About the NULL 'potential problem', maybe we could use a dummy nil (but mapped) object, and use its address in lists, ie compare for nil instead of NULL. This would avoid : - The conditional test in some prefetch() implementations - The potential TLB problem with the NULL value. You avoid those two, but add the very high likelyhood that a statement of the form 'if (!x) {}' will creep back in and bug out. I doubt that changing the form of a basic C idiom to save a few cycles is worth it, especially when the (!x) form can be tested without a memory fetch, but (x != nil) may require fetching 'nil' (remember - the x86 is a very popular chipset, but register starved - if the loop is any size at all, 'nil' may require a reload each time around, costing you the win you thought you had pgpWRgqXbzEAw.pgp Description: PGP signature
Re: [patch] epoll use a single inode ...
+ its done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global Pedantry: it's and don't? -Bob - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Mar 08, 2007, at 02:24:04, Eric Dumazet wrote: Kyle Moffett a écrit : Prefetching is also fairly critical on a Power4 or G5 PowerPC system as they have a long memory latency; an L2-cache miss can cost 200+ cycles. On such systems the dcbt prefetch instruction brings in a single 128-byte cacheline and has no serializing effects whatsoever, making it ideal for use in a linked-list- traversal inner loop. OK, 200 cycles... But what is the cost of the conditional branch you added in prefetch (x) ? if (!x) return; (correctly predicted or not, but do powerPC have a BTB ?) Well, PowerPC dcbt does prefetch() correctly, it doesn't ever raise exceptions, doesn't have any side effects, takes only enough CPU to decode the address, and is ignored if it would have to do anything other than load the cacheline from RAM. Prefetch streams are halted when they reach the end of a page boundary (no trapping to the MMU) and if the TLB entry isn't present then they would asynchronously abort. This is a suitable prefetch implementation for G5: #define prefetch(x) __asm__ __volatile__ (dcbt 0, %0::r(x)); On GCC4+ the __builtin_prefetch built-in functions is probably mildly better tuned for most architectures: #define PREFETCH_RD 0 #define PREFETCH_WR 1 #define PREFETCH_LOCALITY_NONE 0 #define PREFETCH_LOCALITY_LOW 1 #define PREFETCH_LOCALITY_MED 2 #define PREFETCH_LOCALITY_HIGH 3 #define prefetch(x, args...) __builtin_prefetch(x ,##args) The pertinent GCC docs: This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed. The value of addr is the address of the memory to prefetch. There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read. The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three. for (i = 0; i n; i++) { a[i] = a[i] + b[i]; __builtin_prefetch (a[i+j], 1, 1); __builtin_prefetch (b[i+j], 0, 1); /* ... */ } Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p-next will not fault if p-next is not a valid address, but evaluation will fault if p is not a valid address. If the target does not support data prefetch, the address expression is evaluated if it includes side effects but no other code is generated and GCC does not issue a warning. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Yeah, I'm not at all surprised. Any implementation of prefetch that doesn't just turn into a no-op if the TLB entry doesn't exist (which makes them weaker for *actual* prefetching) will generally have a hard time with a NULL pointer. Exactly because it will try to do a totally unnecessary TLB fill - and since most CPU's will not cache negative TLB entries, that unnecessary TLB fill will be done over and over and over again.. Yeah this is exactly what we were seeing :) Anton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
OK, 200 cycles... But what is the cost of the conditional branch you added in prefetch(x) ? Much less than the tablewalk. On ppc64 a tablewalk of an address that is not populated in the hashtable will incur 2 cacheline lookups (primary and secondary buckets). This plus the MMU state machine overhead adds up. Cue Linus rant about PowerPC MMU :) Anton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Hi, Well, PowerPC dcbt does prefetch() correctly, it doesn't ever raise exceptions, doesn't have any side effects, takes only enough CPU to decode the address, and is ignored if it would have to do anything other than load the cacheline from RAM. Prefetch streams are halted when they reach the end of a page boundary (no trapping to the MMU) and if the TLB entry isn't present then they would asynchronously abort. It depends on the implementation and the HID bit settings. Some do walk the MMU hashtable if it isnt in the TLB. Anton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On 3/7/07, Linus Torvalds [EMAIL PROTECTED] wrote: No, I just checked, and Intel's own optimization manual makes it clear that you should be careful. They talk about performance penalties due to resource constraints - which makes tons of sense with a core that is good at handling its own resources and could quite possibly use those resources better to actually execute the loads and stores deeper down the instruction pipeline. Certainly you should be careful -- and that usually means leaving it up to the compiler. But hinting to the compiler can help; there may be an analogue of the (un)likely macros waiting to be implemented for loop prefetch. And while out-of-order execution and fancy hardware prefetch streams greatly reduce the need for explicit prefetch in general code, there's no substitute for the cache _bypassing_ instructions when trying to avoid excessive cache eviction in DDoS situations. For instance, if I wind up working on a splay-tree variant of Robert Olsson's trie/hash work, I'll try to measure the effect of using SSE2 non-temporal stores to write half-open connections to the leaves of the tree. That may give some additional improvement in the ability to keep servicing real load during a SYN flood. So it's not just 3DNow! making AMD look bad, or Intel would obviously suggest people use it out of the wazoo ;) Intel puts a lot of effort into educating compiler writers about the optimal prefetch insertion strategies for particular cache architectures. At the same time, they put out the orange cones to warn people off of hand-tuning prefetch placement using micro-benchmarks. People did that when 3DNow! first came out, with predictable consequences. XScale gets it right. Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. Calling that getting it right is ludicrous. If anything, it gets things so wrong that prefetching is *required* for good performance. That's not an accident. Hardware prefetch units cost a lot in power consumption. Omitting the hardware prefetch unit and drastically simplifying the pipeline is how they got a design whose clock they could crank into the stratosphere and still run on battery power. And in the network processor space, they can bolt a myriad of on-chip microengines and still have some prayer of accurately simulating the patterns of internal bus cycles. Errors in simulation can still be fixed up with prefetch instruction placement to put memory accesses from the XScale core into phases where the data path processors aren't working so hard. Moreover, because they're embedded targets and rarely have to run third-party binaries originally compiled for older cores, it didn't really cost them anything to say, Sorry, this chip's performance is going to suck if your compiler's prefetch insertion isn't properly tuned. The _only_ cost is a slightly less dense instruction stream. That's not trivial but it's manageable; you budget for it, and the increase in I-cache power consumption is more than made up for by the reduction in erroneous data prefetches (hardware prefetch gets it wrong a substantial fraction of the time). I'm talking about real CPU's with real memory pipelines that already do prefetching in hardware. The better the core is, the less the prefetch helps (and often the more it hurts in comparison to how much it helps). The more sophisticated the core is, the less software prefetch instructions help. But more sophisticated isn't always better; it depends on your target applications. But if you mean doesn't try to fill the TLB on data prefetches, then yes, that's generally the right thing to do. AOL. (Oddly, Prescott seems to have initiated a page table walk on DTLB miss during software prefetch -- just one of many weird Prescott flaws.) Netburst in general is *very* happy to do speculative TLB fills, I think. Design by micro-benchmark. :-) They set out to push the headline MHz and the real memory bandwidth to the limit in Prescott, and they succeeded (data at http://www.digit-life.com/articles2/rmma/rmma-p4.html). At a horrible cost in power per clock, and no gain in real application performance. So NetBurst died a horrible death, and now we have Intel Core -- P6 warmed over, with caches sized such that for most applications the second core ought to be used solely to soak up control path overheads. Windows runs better on dual-core machines because the NT kernel will happily eat an entire core doing memory bookkeeping. Linux could take a hint here and use the second core largely for interrupt handling and force-priming the L2 cache on task switch. (Prefetch instructions aren't much use here, precisely because they give up on DTLB miss.) Any kernel code paths that are known to stall a lot because of usually-cold-cache access patterns (TCP connection establishment, for instance) can also be punted over to the second core. If you're feeling industrious, use non-temporal memory accesses judiciously in
Re: [patch] epoll use a single inode ...
This patch needs a lot more documentation. It needs some really big comments on why this should never ever be used for a real filesystem (real as in user mountable), and probably add an assert for that invariant somewhere. Please also update Documentation/filesystems/Locking and Documentation/filesystems/vfs.txt for it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Thursday 08 March 2007 09:56, Christoph Hellwig wrote: This patch needs a lot more documentation. It needs some really big comments on why this should never ever be used for a real filesystem (real as in user mountable), and probably add an assert for that invariant somewhere. Please also update Documentation/filesystems/Locking and Documentation/filesystems/vfs.txt for it. Thank you for your feedback Christoph Here is the version I was about to push to -mm, so if nobody complains, I ask Andrew to push it to mm so that it can reach 2.6.22 target [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to build a pathname for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet [EMAIL PROTECTED] CC: Christoph Hellwig [EMAIL PROTECTED] CC: Linus Torvalds [EMAIL PROTECTED] CC: Davide Libenzi davidel@xmailserver.org CC: Al Viro [EMAIL PROTECTED] Documentation/filesystems/Locking |2 ++ Documentation/filesystems/vfs.txt | 12 +++- fs/dcache.c |3 +++ fs/pipe.c | 12 +--- include/linux/dcache.h|1 + net/socket.c | 13 ++--- 6 files changed, 36 insertions(+), 7 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-08 10:14:38.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt 2007-03-08 10:14:38.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt 2007-03-08 10:34:34.0 +0100 @@ -827,7 +827,7 @@ This describes how a filesystem can over operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here. These methods may be set to NULL, as they are either optional or -the VFS uses a default. As of kernel 2.6.13, the following members are +the VFS uses a default. As of kernel 2.6.22, the following members are defined: struct dentry_operations { @@ -837,6 +837,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; d_revalidate: called when the VFS needs to revalidate a dentry. This @@ -859,6 +860,15 @@ struct dentry_operations { VFS calls iput(). If you define this method, you must call iput() yourself + d_dname: called when the pathname of a dentry should be generated. + Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay + pathname generation. (Instead of doing it when dentry is created, + its done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global + dcache hash, so their hash should be an invariant. As no lock is + held, d_dname() should not try to modify the dentry itself, unless + appropriate SMP safety is used. + Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory. --- linux-2.6.21-rc3/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 @@ -15,6 +15,7 @@ prototypes: int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)((struct dentry *dentry, char *buffer, int buflen); locking rules: none have BKL @@ -25,6 +26,7 @@ d_compare:no yes no no d_delete: yes no yes no d_release: no
Re: [patch] epoll use a single inode ...
On Thu, Mar 08, 2007 at 10:42:21AM +0100, Eric Dumazet wrote: @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str struct vfsmount *rootmnt; struct dentry *root; + if (dentry-d_op dentry-d_op-d_dname) + return (dentry-d_op-d_dname)(dentry, buf, buflen); Please don't put braces around the function pointer call. Also I think we really want a comment here why we need this call. +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) normally this would be: static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) +static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen) same here. - char name[32]; - this.len = sprintf(name, [%lu], SOCK_INODE(sock)-i_ino); - this.name = name; + this.len = 0; + this.name = ; this.hash = 0; Can you add a helper to init this one? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Thanks again for your feedback Christoph :) I think I addressed all your remarks. Thank you [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to build a pathname for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet [EMAIL PROTECTED] CC: Christoph Hellwig [EMAIL PROTECTED] CC: Linus Torvalds [EMAIL PROTECTED] CC: Davide Libenzi davidel@xmailserver.org CC: Al Viro [EMAIL PROTECTED] Documentation/filesystems/Locking |2 ++ Documentation/filesystems/vfs.txt | 12 +++- fs/dcache.c |7 +++ fs/pipe.c | 15 +-- include/linux/dcache.h|1 + net/socket.c | 17 ++--- 6 files changed, 40 insertions(+), 14 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-08 11:57:41.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt 2007-03-08 10:14:38.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt 2007-03-08 10:34:34.0 +0100 @@ -827,7 +827,7 @@ This describes how a filesystem can over operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here. These methods may be set to NULL, as they are either optional or -the VFS uses a default. As of kernel 2.6.13, the following members are +the VFS uses a default. As of kernel 2.6.22, the following members are defined: struct dentry_operations { @@ -837,6 +837,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)(struct dentry *, char *, int); }; d_revalidate: called when the VFS needs to revalidate a dentry. This @@ -859,6 +860,15 @@ struct dentry_operations { VFS calls iput(). If you define this method, you must call iput() yourself + d_dname: called when the pathname of a dentry should be generated. + Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay + pathname generation. (Instead of doing it when dentry is created, + its done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global + dcache hash, so their hash should be an invariant. As no lock is + held, d_dname() should not try to modify the dentry itself, unless + appropriate SMP safety is used. + Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory. --- linux-2.6.21-rc3/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 +++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking 2007-03-08 10:29:04.0 +0100 @@ -15,6 +15,7 @@ prototypes: int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); locking rules: none have BKL @@ -25,6 +26,7 @@ d_compare:no yes no no d_delete: yes no yes no d_release: no no no yes d_iput:no no no yes +d_dname: no no no no --- inode_operations --- prototypes: --- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-08 11:57:41.0 +0100 @@ -1823,6 +1823,13 @@ char
Re: [patch] epoll use a single inode ...
I'm sorry about complaining again, but.. + /* + * Some filesystems want to provide dentry's pathname themselves, + * instead of pre-building names at dentry creation. + */ It's not _some_ filesystems. If real filesystem did this we'd be in horrible trouble. It's synthetic, non-mountable filesystem. Except for this the patch looks fine to me, thanks Eric! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Kyle Moffett a écrit : Prefetching is also fairly critical on a Power4 or G5 PowerPC system as they have a long memory latency; an L2-cache miss can cost 200+ cycles. On such systems the "dcbt" prefetch instruction brings in a single 128-byte cacheline and has no serializing effects whatsoever, making it ideal for use in a linked-list-traversal inner loop. OK, 200 cycles... But what is the cost of the conditional branch you added in prefetch(x) ? if (!x) return; (correctly predicted or not, but do powerPC have a BTB ?) About the NULL 'potential problem', maybe we could use a dummy nil (but mapped) object, and use its address in lists, ie compare for instead of NULL. This would avoid : - The conditional test in some prefetch() implementations - The potential TLB problem with the NULL value. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Michael K. Edwards wrote: > > People's prejudices against prefetch instructions are sometimes > traceable to the 3DNow! prefetch(w) botch, which some processors > "support" as no-ops and others are too aggressive about (Opteron > prefetches are reputed to be "strong", i. e., not dropped on DTLB > miss). No, I just checked, and Intel's own optimization manual makes it clear that you should be careful. They talk about performance penalties due to resource constraints - which makes tons of sense with a core that is good at handling its own resources and could quite possibly use those resources better to actually execute the loads and stores deeper down the instruction pipeline. So it's not just 3DNow! making AMD look bad, or Intel would obviously suggest people use it out of the wazoo ;) > XScale gets it right. Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. Calling that "getting it right" is ludicrous. If anything, it gets things so wrong that prefetching is *required* for good performance. I'm talking about real CPU's with real memory pipelines that already do prefetching in hardware. The better the core is, the less the prefetch helps (and often the more it hurts in comparison to how much it helps). But if you mean "doesn't try to fill the TLB on data prefetches", then yes, that's generally the right thing to do. > (Oddly, Prescott seems to have initiated a page table walk on DTLB miss > during software prefetch -- just one of many weird Prescott flaws.) Netburst in general is *very* happy to do speculative TLB fills, I think. > I'm guessing Pentium M and its descendants (Core Solo and Duo) get it > right but I'm having a hell of a time finding out for sure. Can any of > the x86 experts answer this? I just suspect that the upside for Core 2 Due is likely fairly low. The L2 cache is good, the memory re-ordering is working.. I doubt "prefetch" helps in generic code that much for things like linked list following, you should probably limit it to code that has *known* access patterns and you know it's not going to be in the cache. (In other words, I bet prefetching can help a lot with MMX/media kind of code, I doubt it's a huge win for "for_each_entry()") Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Mar 07, 2007, at 20:25:14, Michael K. Edwards wrote: On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote In general, using software prefetching is just a stupid idea, unless - the prefetch really is very strict (ie for a linked list you do exactly the above kinds of things to make sure that you don't try to prefetch the non-existent end entry) AND - the CPU is stupid (in-order in particular). I think Intel even suggests in their optimization manuals to *not* do software prefetching, because hw can usually simply do better without it. Not the XScale -- it performs quite poorly without prefetch, as people who have run ARMv5-optimized binaries on it can testify. The Intel XScale(r) core prefetch load instruction is a true prefetch instruction because the load destination is the data or mini-data cache and not a register. Compilers for processors which have data caches, but do not support prefetch, sometimes use a load instruction to preload the data cache. This technique has the disadvantages of using a register to load data and requiring additional registers for subsequent preloads and thus increasing register pressure. By contrast, the prefetch can be used to reduce register pressure instead of increasing it. The prefetch load is a hint instruction and does not guarantee that the data will be loaded. Whenever the load would cause a fault or a table walk, then the processor will ignore the prefetch instruction, the fault or table walk, and continue processing the next instruction. This is particularly advantageous in the case where a linked list or recursive data structure is terminated by a NULL pointer. Prefetching the NULL pointer will not fault program flow. Prefetching is also fairly critical on a Power4 or G5 PowerPC system as they have a long memory latency; an L2-cache miss can cost 200+ cycles. On such systems the "dcbt" prefetch instruction brings in a single 128-byte cacheline and has no serializing effects whatsoever, making it ideal for use in a linked-list-traversal inner loop. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote Yeah, I'm not at all surprised. Any implementation of "prefetch" that doesn't just turn into a no-op if the TLB entry doesn't exist (which makes them weaker for *actual* prefetching) will generally have a hard time with a NULL pointer. Exactly because it will try to do a totally unnecessary TLB fill - and since most CPU's will not cache negative TLB entries, that unnecessary TLB fill will be done over and over and over again.. Data prefetch instructions should indeed avoid page table walks. (Instruction prefetch mechanisms often do induce table walks on ITLB miss.) Not just because of the null pointer case, but because it's quite normal to run off the end of an array in a loop with an embedded prefetch instruction. If you have an extra instruction issue unit that shares the same DTLB, and you know you will really want that data, you can sometimes use it to force DTLB preloads by doing an actual data fetch from the foreseeable page. This is potentially one of the best uses of chip multi-threading on an architecture like Sun's Niagara. (I don't think Intel's hyper-threading works for this purpose; the DTLB is shared but the entries are marked as owned by one thread or the other. HT can be used for L2 cache prefetching, although the results so far seem to be mixed: http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVISED.pdf) In general, using software prefetching is just a stupid idea, unless - the prefetch really is very strict (ie for a linked list you do exactly the above kinds of things to make sure that you don't try to prefetch the non-existent end entry) AND - the CPU is stupid (in-order in particular). I think Intel even suggests in their optimization manuals to *not* do software prefetching, because hw can usually simply do better without it. Not the XScale -- it performs quite poorly without prefetch, as people who have run ARMv5-optimized binaries on it can testify. From the XScale Core Developer's Manual: The Intel XScale(r) core has a true prefetch load instruction (PLD). The purpose of this instruction is to preload data into the data and mini-data caches. Data prefetching allows hiding of memory transfer latency while the processor continues to execute instructions. The prefetch is important to compiler and assembly code because judicious use of the prefetch instruction can enormously improve throughput performance of the core. Data prefetch can be applied not only to loops but also to any data references within a block of code. Prefetch also applies to data writing when the memory type is enabled as write allocate The Intel XScale(r) core prefetch load instruction is a true prefetch instruction because the load destination is the data or mini-data cache and not a register. Compilers for processors which have data caches, but do not support prefetch, sometimes use a load instruction to preload the data cache. This technique has the disadvantages of using a register to load data and requiring additional registers for subsequent preloads and thus increasing register pressure. By contrast, the prefetch can be used to reduce register pressure instead of increasing it. The prefetch load is a hint instruction and does not guarantee that the data will be loaded. Whenever the load would cause a fault or a table walk, then the processor will ignore the prefetch instruction, the fault or table walk, and continue processing the next instruction. This is particularly advantageous in the case where a linked list or recursive data structure is terminated by a NULL pointer. Prefetching the NULL pointer will not fault program flow. People's prejudices against prefetch instructions are sometimes traceable to the 3DNow! prefetch(w) botch, which some processors "support" as no-ops and others are too aggressive about (Opteron prefetches are reputed to be "strong", i. e., not dropped on DTLB miss). XScale gets it right. So do most Pentium 4's using the SSE prefetches, according to the IA-32 optimization manual. (Oddly, Prescott seems to have initiated a page table walk on DTLB miss during software prefetch -- just one of many weird Prescott flaws.) I'm guessing Pentium M and its descendants (Core Solo and Duo) get it right but I'm having a hell of a time finding out for sure. Can any of the x86 experts answer this? Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Anton Blanchard wrote: > > Funny you mention this. We found some noticeable ppc64 regressions when > moving the dcache to standard list macros and had to do this to fix it > up: > > static inline void prefetch(const void *x) > { > if (unlikely(!x)) > return; > > __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x)); > } > > Urgh :) Yeah, I'm not at all surprised. Any implementation of "prefetch" that doesn't just turn into a no-op if the TLB entry doesn't exist (which makes them weaker for *actual* prefetching) will generally have a hard time with a NULL pointer. Exactly because it will try to do a totally unnecessary TLB fill - and since most CPU's will not cache negative TLB entries, that unnecessary TLB fill will be done over and over and over again.. In general, using software prefetching is just a stupid idea, unless - the prefetch really is very strict (ie for a linked list you do exactly the above kinds of things to make sure that you don't try to prefetch the non-existent end entry) AND - the CPU is stupid (in-order in particular). I think Intel even suggests in their optimization manuals to *not* do software prefetching, because hw can usually simply do better without it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Hi, > Well, I hope a prefetch(NULL) is OK because we are doing millions of > them (see include/linux/list.h :) ) Funny you mention this. We found some noticeable ppc64 regressions when moving the dcache to standard list macros and had to do this to fix it up: static inline void prefetch(const void *x) { if (unlikely(!x)) return; __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x)); } Urgh :) Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: > > I was thinking about being able to cache the name into the dentry, do you > think it's worth the pain ? (its not SMP safe for example...) Actually, it *can* be SMP-safe, if you do it right. Something like len = dentry->d_name.len; if (!len) { len = snprintf(dentry->d_iname, DNAME_INLINE_LEN_MIN, "pipe:[%lu]", dentry->d_inode->i_ino); smp_wmb(); dentry->d_name.len = len; } if (len >= buflen) len = buflen-1; memcpy(buffer, dentry->d_iname, len); buffer[len] = 0; return buffer; should work, although it depends on the fact that our snprintf() implementation should be "stable" (ie if snprintf() modifies the buffer temporarily as it goes along, that would break, but I think our vsnprintf is good in that respect). So you could have two different CPU's doing the snprintf() on the same buffer at the same time (and assigning the length at the same time), but since they'll write the same thing, you don't really care. It's a bit subtle, though. And probably not really worth it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
> Not only might memcpy() do a "prefetch for read" on the source for some > architectures (which in turn may end up being slow for an address that > isn't in the TLB, like NULL), but you depend on a very much internal Well, I hope a prefetch(NULL) is OK because we are doing millions of them (see include/linux/list.h :) ) > detail, since it *could* have been using something like > > memcpy(dname, name->name, name->len+1); > Yes very true, I will change that and push to Andrew for mm testing I was thinking about being able to cache the name into the dentry, do you think it's worth the pain ? (its not SMP safe for example...) static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) { if (!dentry->d_iname[0]) snprintf(dentry->d_iname, DNAME_INLINE_LEN_MIN, "pipe:[%lu]", dentry->d_inode->i_ino); strlcpy(buffer, dentry->d_iname, buflen); return buffer; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: > > OK no problem here is the patch without messing f_path.mnt All right. I like it. Definitely worth putting into -mm, or just re-sending to me after 2.6.21 is out (I'll forget all about it otherwise). I have one more worry, namely this:: - char name[32]; - this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino); - this.name = name; + this.len = 0; + this.name = NULL; I think that's fine, and it *happens* to work, but it happens to work just because then d_alloc() will do: memcpy(dname, name->name, name->len); dname[name->len] = 0; and passing in NULL to memcpy() is generally ok when len is 0. HOWEVER. Not only might memcpy() do a "prefetch for read" on the source for some architectures (which in turn may end up being slow for an address that isn't in the TLB, like NULL), but you depend on a very much internal detail, since it *could* have been using something like memcpy(dname, name->name, name->len+1); instead, and expected to get the final '\0' character from the source string. So I would actually much prefer it to be written as this.len = 0; this.name = ""; just because it's safer. But other than that small detail, I think this is not only an optimization, it's an actual cleanup, and we migth some day want to use something like this for some other things too (eg maybe this kind of approach is usable for /proc/ entries too, to avoid instantiating them). As to avoiding the mntget(), I'm not violently opposed to it, but I do think that it's a totally unrelated matter, so even if it's decided it's worth it, it should be done as a separate patch regardless. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wednesday 07 March 2007 18:45, Linus Torvalds wrote: > On Wed, 7 Mar 2007, Eric Dumazet wrote: > > sockets already uses file->private_data. > > > > But calls to read()/write() (not send()/recv()) still need to go through > > the dentry, before entering socket land. > > Sure. The dentry and the inode need to *exist*, but they can be one single > static dentry/inode per "file descriptor type". > > We always pass in the "struct file *" to read/write too, since we need it > anyway for things like file control information (eg "is it a nonblocking > read or write" kinds of things). > > So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static > one per type. > > And yeah, it may be harder than it looks. Some things "know" that all the > relevant info is in the inode, so they just pass in the inode. In the pipe > layer, for example, you'd need to change free_pipe_info() and > alloc_pipe_info() to pass in the file descriptor instead, same goes for > pipe_release(). But the "struct file *" is always available, it's just > that since the code was originally written to have all the info in the > inode, some of the code isn't set up to use it or pass it on.. > > But your patch is independent of that, and looks fine. Except I don't like > this part: > > - file->f_path.mnt = mntget(sock_mnt); > + file->f_path.mnt = NULL; > > since I'd be much happer with always having f_path.mnt available, the same > way we should always have f_path.dentry there. Yes, but mntget()/mntput() are protected against NULL. I was quite happy to remove two locked operations :) I didnt found a way to crash (yet) my patched machine :) > > (Btw, your patch is *not* going to work with the file->f_private_data > approach, because d_path() is not passed down the "file *" thing. So we'd > need to do that, and that's more intrusive (it can be NULL, since for > things like cwd/pwd we don't have a "struct file"). I tried this path today and failed... Too many changes to do (nameidata) to propagate a 'struct file *' appropriately... > > But I like your patch as a totally independent thing. "It just makes > sense". > > (Apart from the f_path.mnt thing, which I think was something else ;) OK no problem here is the patch without messing f_path.mnt (benchmark results not really different on my little machine, SMP kernel but one CPU only... maybe because lock suffix is changed by a nop) [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to be able to provide a dentry name for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> fs/dcache.c|3 +++ fs/pipe.c | 12 +--- include/linux/dcache.h |1 + net/socket.c | 13 ++--- 4 files changed, 23 insertions(+), 6 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-07 17:27:39.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100 @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str struct vfsmount *rootmnt; struct dentry *root; + if (dentry->d_op && dentry->d_op->d_dname) + return (dentry->d_op->d_dname)(dentry, buf, buflen); + read_lock(>fs->lock); rootmnt = mntget(current->fs->rootmnt); root = dget(current->fs->root); --- linux-2.6.21-rc3/fs/pipe.c 2007-03-07 17:42:36.0 +0100 +++ linux-2.6.21-rc3-ed/fs/pipe.c 2007-03-07 18:54:33.0 +0100 @@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d return 0; } +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino); +
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: > > sockets already uses file->private_data. > > But calls to read()/write() (not send()/recv()) still need to go through the > dentry, before entering socket land. Sure. The dentry and the inode need to *exist*, but they can be one single static dentry/inode per "file descriptor type". We always pass in the "struct file *" to read/write too, since we need it anyway for things like file control information (eg "is it a nonblocking read or write" kinds of things). So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static one per type. And yeah, it may be harder than it looks. Some things "know" that all the relevant info is in the inode, so they just pass in the inode. In the pipe layer, for example, you'd need to change free_pipe_info() and alloc_pipe_info() to pass in the file descriptor instead, same goes for pipe_release(). But the "struct file *" is always available, it's just that since the code was originally written to have all the info in the inode, some of the code isn't set up to use it or pass it on.. But your patch is independent of that, and looks fine. Except I don't like this part: - file->f_path.mnt = mntget(sock_mnt); + file->f_path.mnt = NULL; since I'd be much happer with always having f_path.mnt available, the same way we should always have f_path.dentry there. (Btw, your patch is *not* going to work with the file->f_private_data approach, because d_path() is not passed down the "file *" thing. So we'd need to do that, and that's more intrusive (it can be NULL, since for things like cwd/pwd we don't have a "struct file"). But I like your patch as a totally independent thing. "It just makes sense". (Apart from the f_path.mnt thing, which I think was something else ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
(resending with a more convenient attachment) Please find enclosed the following patch, to prepare this path. [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to be able to provide a dentry name for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on sock_mnt 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on pipe_mnt A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> fs/dcache.c | 3 +++ fs/pipe.c | 16 +++- include/linux/dcache.h | 1 + net/socket.c | 15 +++ 4 files changed, 26 insertions(+), 9 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-07 17:27:39.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100 @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str struct vfsmount *rootmnt; struct dentry *root; + if (dentry->d_op && dentry->d_op->d_dname) + return (dentry->d_op->d_dname)(dentry, buf, buflen); + read_lock(>fs->lock); rootmnt = mntget(current->fs->rootmnt); root = dget(current->fs->root); --- linux-2.6.21-rc3/fs/pipe.c 2007-03-07 17:42:36.0 +0100 +++ linux-2.6.21-rc3-ed/fs/pipe.c 2007-03-07 18:01:40.0 +0100 @@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d return 0; } +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino); + return buffer; +} + static struct dentry_operations pipefs_dentry_operations = { .d_delete = pipefs_delete_dentry, + .d_dname= pipefs_dname, }; static struct inode * get_pipe_inode(void) @@ -888,7 +895,6 @@ struct file *create_write_pipe(void) struct inode *inode; struct file *f; struct dentry *dentry; - char name[32]; struct qstr this; f = get_empty_filp(); @@ -899,8 +905,8 @@ struct file *create_write_pipe(void) if (!inode) goto err_file; - this.len = sprintf(name, "[%lu]", inode->i_ino); - this.name = name; + this.len = 0; + this.name = NULL; this.hash = 0; err = -ENOMEM; dentry = d_alloc(pipe_mnt->mnt_sb->s_root, ); @@ -915,7 +921,7 @@ struct file *create_write_pipe(void) */ dentry->d_flags &= ~DCACHE_UNHASHED; d_instantiate(dentry, inode); - f->f_path.mnt = mntget(pipe_mnt); + f->f_path.mnt = NULL; f->f_path.dentry = dentry; f->f_mapping = inode->i_mapping; @@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil return ERR_PTR(-ENFILE); /* Grab pipe from the writer */ - f->f_path.mnt = mntget(wrf->f_path.mnt); + f->f_path.mnt = NULL; f->f_path.dentry = dget(wrf->f_path.dentry); f->f_mapping = wrf->f_path.dentry->d_inode->i_mapping; --- linux-2.6.21-rc3/net/socket.c 2007-03-07 17:37:56.0 +0100 +++ linux-2.6.21-rc3-ed/net/socket.c2007-03-07 17:54:53.0 +0100 @@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d dentry->d_flags |= DCACHE_UNHASHED; return 0; } + +static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino); + return buffer; +} + static struct dentry_operations sockfs_dentry_operations = { .d_delete = sockfs_delete_dentry, + .d_dname = sockfs_dname, }; /* @@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f static int sock_attach_fd(struct socket *sock, struct file *file) { struct qstr this; - char
Re: [patch] epoll use a single inode ...
On Wednesday 07 March 2007 18:02, Linus Torvalds wrote: > On Wed, 7 Mar 2007, Eric Dumazet wrote: > > I would definitly *love* saving dentries for pipes (and sockets too), but > > how are you going to get the inode ? > > Don't use an inode at all. Lovely :) > > > pipes()/sockets() can use read()/write()/rw_verify_area() and thus need > > file->f_path.dentry->d_inode (so each pipe needs a separate dentry) > > No, at least pipes could easily just use "file->f_private_data" instead. > > Now, sockets really do want the inode (or it would be really really big > changes), but pipes really just want a "struct pipe_inode_info" pointer, > which we could hide away directly in the file descriptor itself. sockets already uses file->private_data. But calls to read()/write() (not send()/recv()) still need to go through the dentry, before entering socket land. > > That's what Davide already did (on my suggestion) for signalfd - there's a > *single* inode, and the real data is in the per-fd f_private_data. Please find enclosed the following patch, to prepare this path. [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to be able to provide a dentry name for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on sock_mnt 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on pipe_mnt A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> fs/dcache.c|3 +++ fs/pipe.c | 16 +++- include/linux/dcache.h |1 + net/socket.c | 15 +++ 4 files changed, 26 insertions(+), 9 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-07 17:27:39.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/fs/dcache.c 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100 @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str struct vfsmount *rootmnt; struct dentry *root; + if (dentry->d_op && dentry->d_op->d_dname) + return (dentry->d_op->d_dname)(dentry, buf, buflen); + read_lock(>fs->lock); rootmnt = mntget(current->fs->rootmnt); root = dget(current->fs->root); --- linux-2.6.21-rc3/fs/pipe.c 2007-03-07 17:42:36.0 +0100 +++ linux-2.6.21-rc3-ed/fs/pipe.c 2007-03-07 18:01:40.0 +0100 @@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d return 0; } +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino); + return buffer; +} + static struct dentry_operations pipefs_dentry_operations = { .d_delete = pipefs_delete_dentry, + .d_dname = pipefs_dname, }; static struct inode * get_pipe_inode(void) @@ -888,7 +895,6 @@ struct file *create_write_pipe(void) struct inode *inode; struct file *f; struct dentry *dentry; - char name[32]; struct qstr this; f = get_empty_filp(); @@ -899,8 +905,8 @@ struct file *create_write_pipe(void) if (!inode) goto err_file; - this.len = sprintf(name, "[%lu]", inode->i_ino); - this.name = name; + this.len = 0; + this.name = NULL; this.hash = 0; err = -ENOMEM; dentry = d_alloc(pipe_mnt->mnt_sb->s_root, ); @@ -915,7 +921,7 @@ struct file *create_write_pipe(void) */ dentry->d_flags &= ~DCACHE_UNHASHED; d_instantiate(dentry, inode); - f->f_path.mnt = mntget(pipe_mnt); + f->f_path.mnt = NULL; f->f_path.dentry = dentry; f->f_mapping = inode->i_mapping; @@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil return ERR_PTR(-ENFILE); /* Grab pipe from the writer */ - f->f_path.mnt = mntget(wrf->f_path.mnt); + f->f_path.mnt = NULL; f->f_path.dentry = dget(wrf->f_path.dentry); f->f_mapping = wrf->f_path.dentry->d_inode->i_mapping; --- linux-2.6.21-rc3/net/socket.c 2007-03-07 17:37:56.0 +0100 +++ linux-2.6.21-rc3-ed/net/socket.c 2007-03-07 17:54:53.0
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: > > Crazy ideas : (some readers are going to kill me) First off, as noted earlier, you don't need crazy ideas. But: > 1) Use the low order bit of f_path.dentry to say : this pointer is not a > pointer to a dentry but the inode pointer (with the low order bit set to 1) No, we don't want to do that. There's a lot of code that wants to just follow the dentry/inode chain unconditionally, and we want to have people able to just do i_size_read(file->f_path.dentry->d_inode); kind of thing, without having to check anything in between. It's costly enough that we have a chain of pointers, it's much *worse* if we then have to make conditionals etc (and forgetting is an oops). > 2) file->f_path.dentry set to NULL for this special files (so that we dont > need to dput() and cache line ping pong the common dentry each time we > __fput() a pipe/socket. I agree abot the ping pong, but I suspect it's a lot less than the current pingpong on dcache_lock, which is a lot hotter than a pipe-dentry counter would be. If it becomes a big issue, we could just have a fixed number of dentries, and hash them out some way (to dilute the ping-ponging). So making things NULL would just be horrible for everybody, since we have lots of generic code that just looks up the dentry and inode, and we don't want to make that worse. (I'm sure using "filp->f_private_data" has some downsides too, but it seems fairly simple at least for pipes. The biggest problem is likely that we currently use the mutex in the inode in "filp->f_dentry->d_inode" as a way to protect the inode->i_pipe pointer itself, and there is no equivalent locking at the "struct file *" level at all. We might have to make the "f_ep_lock" spinlock be unconditional) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: > > I would definitly *love* saving dentries for pipes (and sockets too), but how > are you going to get the inode ? Don't use an inode at all. > pipes()/sockets() can use read()/write()/rw_verify_area() and thus need > file->f_path.dentry->d_inode (so each pipe needs a separate dentry) No, at least pipes could easily just use "file->f_private_data" instead. Now, sockets really do want the inode (or it would be really really big changes), but pipes really just want a "struct pipe_inode_info" pointer, which we could hide away directly in the file descriptor itself. That's what Davide already did (on my suggestion) for signalfd - there's a *single* inode, and the real data is in the per-fd f_private_data. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, Mar 07, 2007 at 08:16:14AM +0100, Eric Dumazet wrote: > Crazy ideas : (some readers are going to kill me) > > 1) Use the low order bit of f_path.dentry to say : this pointer is not a > pointer to a dentry but the inode pointer (with the low order bit set to 1) > > OR > > 2) file->f_path.dentry set to NULL for this special files (so that we dont > need to dput() and cache line ping pong the common dentry each time we > __fput() a pipe/socket. No way on either one. f_path.dentry always beeing there is an assumption we make all over the place, and changing that would be a big regression for code qualityand reliability all over the place. Face it folks, memory is generally cheap, and we're not going to uglify huge amounts of code to shave of a little bit. [and that is only in reply to this one, the single dentry optimizations for epoll and friends are perfecltly fine from the high level view] > Same trick could be used for file->f_path.mnt, because there is a big SMP > cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint > while these file systems cannot be un-mounted) Same thing as above. We might do a hack to not refcount these vfsmounts, but we definitively want to keep the invariant of f_path.mnt never beeing NULL. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, Mar 07, 2007 at 08:16:14AM +0100, Eric Dumazet wrote: Crazy ideas : (some readers are going to kill me) 1) Use the low order bit of f_path.dentry to say : this pointer is not a pointer to a dentry but the inode pointer (with the low order bit set to 1) OR 2) file-f_path.dentry set to NULL for this special files (so that we dont need to dput() and cache line ping pong the common dentry each time we __fput() a pipe/socket. No way on either one. f_path.dentry always beeing there is an assumption we make all over the place, and changing that would be a big regression for code qualityand reliability all over the place. Face it folks, memory is generally cheap, and we're not going to uglify huge amounts of code to shave of a little bit. [and that is only in reply to this one, the single dentry optimizations for epoll and friends are perfecltly fine from the high level view] Same trick could be used for file-f_path.mnt, because there is a big SMP cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint while these file systems cannot be un-mounted) Same thing as above. We might do a hack to not refcount these vfsmounts, but we definitively want to keep the invariant of f_path.mnt never beeing NULL. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: I would definitly *love* saving dentries for pipes (and sockets too), but how are you going to get the inode ? Don't use an inode at all. pipes()/sockets() can use read()/write()/rw_verify_area() and thus need file-f_path.dentry-d_inode (so each pipe needs a separate dentry) No, at least pipes could easily just use file-f_private_data instead. Now, sockets really do want the inode (or it would be really really big changes), but pipes really just want a struct pipe_inode_info pointer, which we could hide away directly in the file descriptor itself. That's what Davide already did (on my suggestion) for signalfd - there's a *single* inode, and the real data is in the per-fd f_private_data. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: Crazy ideas : (some readers are going to kill me) First off, as noted earlier, you don't need crazy ideas. But: 1) Use the low order bit of f_path.dentry to say : this pointer is not a pointer to a dentry but the inode pointer (with the low order bit set to 1) No, we don't want to do that. There's a lot of code that wants to just follow the dentry/inode chain unconditionally, and we want to have people able to just do i_size_read(file-f_path.dentry-d_inode); kind of thing, without having to check anything in between. It's costly enough that we have a chain of pointers, it's much *worse* if we then have to make conditionals etc (and forgetting is an oops). 2) file-f_path.dentry set to NULL for this special files (so that we dont need to dput() and cache line ping pong the common dentry each time we __fput() a pipe/socket. I agree abot the ping pong, but I suspect it's a lot less than the current pingpong on dcache_lock, which is a lot hotter than a pipe-dentry counter would be. If it becomes a big issue, we could just have a fixed number of dentries, and hash them out some way (to dilute the ping-ponging). So making things NULL would just be horrible for everybody, since we have lots of generic code that just looks up the dentry and inode, and we don't want to make that worse. (I'm sure using filp-f_private_data has some downsides too, but it seems fairly simple at least for pipes. The biggest problem is likely that we currently use the mutex in the inode in filp-f_dentry-d_inode as a way to protect the inode-i_pipe pointer itself, and there is no equivalent locking at the struct file * level at all. We might have to make the f_ep_lock spinlock be unconditional) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wednesday 07 March 2007 18:02, Linus Torvalds wrote: On Wed, 7 Mar 2007, Eric Dumazet wrote: I would definitly *love* saving dentries for pipes (and sockets too), but how are you going to get the inode ? Don't use an inode at all. Lovely :) pipes()/sockets() can use read()/write()/rw_verify_area() and thus need file-f_path.dentry-d_inode (so each pipe needs a separate dentry) No, at least pipes could easily just use file-f_private_data instead. Now, sockets really do want the inode (or it would be really really big changes), but pipes really just want a struct pipe_inode_info pointer, which we could hide away directly in the file descriptor itself. sockets already uses file-private_data. But calls to read()/write() (not send()/recv()) still need to go through the dentry, before entering socket land. That's what Davide already did (on my suggestion) for signalfd - there's a *single* inode, and the real data is in the per-fd f_private_data. Please find enclosed the following patch, to prepare this path. [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to be able to provide a dentry name for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on sock_mnt 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on pipe_mnt A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet [EMAIL PROTECTED] fs/dcache.c|3 +++ fs/pipe.c | 16 +++- include/linux/dcache.h |1 + net/socket.c | 15 +++ 4 files changed, 26 insertions(+), 9 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-07 17:27:39.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/fs/dcache.c 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100 @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str struct vfsmount *rootmnt; struct dentry *root; + if (dentry-d_op dentry-d_op-d_dname) + return (dentry-d_op-d_dname)(dentry, buf, buflen); + read_lock(current-fs-lock); rootmnt = mntget(current-fs-rootmnt); root = dget(current-fs-root); --- linux-2.6.21-rc3/fs/pipe.c 2007-03-07 17:42:36.0 +0100 +++ linux-2.6.21-rc3-ed/fs/pipe.c 2007-03-07 18:01:40.0 +0100 @@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d return 0; } +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, pipe:[%lu], dentry-d_inode-i_ino); + return buffer; +} + static struct dentry_operations pipefs_dentry_operations = { .d_delete = pipefs_delete_dentry, + .d_dname = pipefs_dname, }; static struct inode * get_pipe_inode(void) @@ -888,7 +895,6 @@ struct file *create_write_pipe(void) struct inode *inode; struct file *f; struct dentry *dentry; - char name[32]; struct qstr this; f = get_empty_filp(); @@ -899,8 +905,8 @@ struct file *create_write_pipe(void) if (!inode) goto err_file; - this.len = sprintf(name, [%lu], inode-i_ino); - this.name = name; + this.len = 0; + this.name = NULL; this.hash = 0; err = -ENOMEM; dentry = d_alloc(pipe_mnt-mnt_sb-s_root, this); @@ -915,7 +921,7 @@ struct file *create_write_pipe(void) */ dentry-d_flags = ~DCACHE_UNHASHED; d_instantiate(dentry, inode); - f-f_path.mnt = mntget(pipe_mnt); + f-f_path.mnt = NULL; f-f_path.dentry = dentry; f-f_mapping = inode-i_mapping; @@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil return ERR_PTR(-ENFILE); /* Grab pipe from the writer */ - f-f_path.mnt = mntget(wrf-f_path.mnt); + f-f_path.mnt = NULL; f-f_path.dentry = dget(wrf-f_path.dentry); f-f_mapping = wrf-f_path.dentry-d_inode-i_mapping; --- linux-2.6.21-rc3/net/socket.c 2007-03-07 17:37:56.0 +0100 +++ linux-2.6.21-rc3-ed/net/socket.c 2007-03-07 17:54:53.0 +0100 @@ -314,8 +314,16 @@ static int
Re: [patch] epoll use a single inode ...
(resending with a more convenient attachment) Please find enclosed the following patch, to prepare this path. [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to be able to provide a dentry name for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on sock_mnt 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... We also avoid mntput()/mntget() on pipe_mnt A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet [EMAIL PROTECTED] fs/dcache.c | 3 +++ fs/pipe.c | 16 +++- include/linux/dcache.h | 1 + net/socket.c | 15 +++ 4 files changed, 26 insertions(+), 9 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-07 17:27:39.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100 @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str struct vfsmount *rootmnt; struct dentry *root; + if (dentry-d_op dentry-d_op-d_dname) + return (dentry-d_op-d_dname)(dentry, buf, buflen); + read_lock(current-fs-lock); rootmnt = mntget(current-fs-rootmnt); root = dget(current-fs-root); --- linux-2.6.21-rc3/fs/pipe.c 2007-03-07 17:42:36.0 +0100 +++ linux-2.6.21-rc3-ed/fs/pipe.c 2007-03-07 18:01:40.0 +0100 @@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d return 0; } +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, pipe:[%lu], dentry-d_inode-i_ino); + return buffer; +} + static struct dentry_operations pipefs_dentry_operations = { .d_delete = pipefs_delete_dentry, + .d_dname= pipefs_dname, }; static struct inode * get_pipe_inode(void) @@ -888,7 +895,6 @@ struct file *create_write_pipe(void) struct inode *inode; struct file *f; struct dentry *dentry; - char name[32]; struct qstr this; f = get_empty_filp(); @@ -899,8 +905,8 @@ struct file *create_write_pipe(void) if (!inode) goto err_file; - this.len = sprintf(name, [%lu], inode-i_ino); - this.name = name; + this.len = 0; + this.name = NULL; this.hash = 0; err = -ENOMEM; dentry = d_alloc(pipe_mnt-mnt_sb-s_root, this); @@ -915,7 +921,7 @@ struct file *create_write_pipe(void) */ dentry-d_flags = ~DCACHE_UNHASHED; d_instantiate(dentry, inode); - f-f_path.mnt = mntget(pipe_mnt); + f-f_path.mnt = NULL; f-f_path.dentry = dentry; f-f_mapping = inode-i_mapping; @@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil return ERR_PTR(-ENFILE); /* Grab pipe from the writer */ - f-f_path.mnt = mntget(wrf-f_path.mnt); + f-f_path.mnt = NULL; f-f_path.dentry = dget(wrf-f_path.dentry); f-f_mapping = wrf-f_path.dentry-d_inode-i_mapping; --- linux-2.6.21-rc3/net/socket.c 2007-03-07 17:37:56.0 +0100 +++ linux-2.6.21-rc3-ed/net/socket.c2007-03-07 17:54:53.0 +0100 @@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d dentry-d_flags |= DCACHE_UNHASHED; return 0; } + +static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, socket:[%lu], dentry-d_inode-i_ino); + return buffer; +} + static struct dentry_operations sockfs_dentry_operations = { .d_delete = sockfs_delete_dentry, + .d_dname = sockfs_dname, }; /* @@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f static int sock_attach_fd(struct socket *sock, struct file *file) { struct qstr this; - char name[32]; - this.len =
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: sockets already uses file-private_data. But calls to read()/write() (not send()/recv()) still need to go through the dentry, before entering socket land. Sure. The dentry and the inode need to *exist*, but they can be one single static dentry/inode per file descriptor type. We always pass in the struct file * to read/write too, since we need it anyway for things like file control information (eg is it a nonblocking read or write kinds of things). So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static one per type. And yeah, it may be harder than it looks. Some things know that all the relevant info is in the inode, so they just pass in the inode. In the pipe layer, for example, you'd need to change free_pipe_info() and alloc_pipe_info() to pass in the file descriptor instead, same goes for pipe_release(). But the struct file * is always available, it's just that since the code was originally written to have all the info in the inode, some of the code isn't set up to use it or pass it on.. But your patch is independent of that, and looks fine. Except I don't like this part: - file-f_path.mnt = mntget(sock_mnt); + file-f_path.mnt = NULL; since I'd be much happer with always having f_path.mnt available, the same way we should always have f_path.dentry there. (Btw, your patch is *not* going to work with the file-f_private_data approach, because d_path() is not passed down the file * thing. So we'd need to do that, and that's more intrusive (it can be NULL, since for things like cwd/pwd we don't have a struct file). But I like your patch as a totally independent thing. It just makes sense. (Apart from the f_path.mnt thing, which I think was something else ;) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wednesday 07 March 2007 18:45, Linus Torvalds wrote: On Wed, 7 Mar 2007, Eric Dumazet wrote: sockets already uses file-private_data. But calls to read()/write() (not send()/recv()) still need to go through the dentry, before entering socket land. Sure. The dentry and the inode need to *exist*, but they can be one single static dentry/inode per file descriptor type. We always pass in the struct file * to read/write too, since we need it anyway for things like file control information (eg is it a nonblocking read or write kinds of things). So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static one per type. And yeah, it may be harder than it looks. Some things know that all the relevant info is in the inode, so they just pass in the inode. In the pipe layer, for example, you'd need to change free_pipe_info() and alloc_pipe_info() to pass in the file descriptor instead, same goes for pipe_release(). But the struct file * is always available, it's just that since the code was originally written to have all the info in the inode, some of the code isn't set up to use it or pass it on.. But your patch is independent of that, and looks fine. Except I don't like this part: - file-f_path.mnt = mntget(sock_mnt); + file-f_path.mnt = NULL; since I'd be much happer with always having f_path.mnt available, the same way we should always have f_path.dentry there. Yes, but mntget()/mntput() are protected against NULL. I was quite happy to remove two locked operations :) I didnt found a way to crash (yet) my patched machine :) (Btw, your patch is *not* going to work with the file-f_private_data approach, because d_path() is not passed down the file * thing. So we'd need to do that, and that's more intrusive (it can be NULL, since for things like cwd/pwd we don't have a struct file). I tried this path today and failed... Too many changes to do (nameidata) to propagate a 'struct file *' appropriately... But I like your patch as a totally independent thing. It just makes sense. (Apart from the f_path.mnt thing, which I think was something else ;) OK no problem here is the patch without messing f_path.mnt (benchmark results not really different on my little machine, SMP kernel but one CPU only... maybe because lock suffix is changed by a nop) [PATCH] Delay the dentry name generation on sockets and pipes. 1) Introduces a new method in 'struct dentry_operations'. This method called d_dname() might be called from d_path() to be able to provide a dentry name for special filesystems. It is called without locks. Future patches (if we succeed in having one common dentry for all pipes) may need to change prototype of this method, but we now use : char *d_dname(struct dentry *dentry, char *buffer, int buflen) 2) Use this new method for sockets : No more sprintf() at socket creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... 3) Use this new method for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet [EMAIL PROTECTED] fs/dcache.c|3 +++ fs/pipe.c | 12 +--- include/linux/dcache.h |1 + net/socket.c | 13 ++--- 4 files changed, 23 insertions(+), 6 deletions(-) --- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/include/linux/dcache.h 2007-03-07 17:27:39.0 +0100 @@ -133,6 +133,7 @@ struct dentry_operations { int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char * (*d_dname)(struct dentry *, char *, int); }; /* the dentry parameter passed to d_hash and d_compare is the parent --- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100 +++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100 @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str struct vfsmount *rootmnt; struct dentry *root; + if (dentry-d_op dentry-d_op-d_dname) + return (dentry-d_op-d_dname)(dentry, buf, buflen); + read_lock(current-fs-lock); rootmnt = mntget(current-fs-rootmnt); root = dget(current-fs-root); --- linux-2.6.21-rc3/fs/pipe.c 2007-03-07 17:42:36.0 +0100 +++ linux-2.6.21-rc3-ed/fs/pipe.c 2007-03-07 18:54:33.0 +0100 @@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d return 0; } +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + snprintf(buffer, buflen, pipe:[%lu], dentry-d_inode-i_ino); + return buffer; +} + static struct dentry_operations
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: OK no problem here is the patch without messing f_path.mnt All right. I like it. Definitely worth putting into -mm, or just re-sending to me after 2.6.21 is out (I'll forget all about it otherwise). I have one more worry, namely this:: - char name[32]; - this.len = sprintf(name, [%lu], SOCK_INODE(sock)-i_ino); - this.name = name; + this.len = 0; + this.name = NULL; I think that's fine, and it *happens* to work, but it happens to work just because then d_alloc() will do: memcpy(dname, name-name, name-len); dname[name-len] = 0; and passing in NULL to memcpy() is generally ok when len is 0. HOWEVER. Not only might memcpy() do a prefetch for read on the source for some architectures (which in turn may end up being slow for an address that isn't in the TLB, like NULL), but you depend on a very much internal detail, since it *could* have been using something like memcpy(dname, name-name, name-len+1); instead, and expected to get the final '\0' character from the source string. So I would actually much prefer it to be written as this.len = 0; this.name = ; just because it's safer. But other than that small detail, I think this is not only an optimization, it's an actual cleanup, and we migth some day want to use something like this for some other things too (eg maybe this kind of approach is usable for /proc/pid entries too, to avoid instantiating them). As to avoiding the mntget(), I'm not violently opposed to it, but I do think that it's a totally unrelated matter, so even if it's decided it's worth it, it should be done as a separate patch regardless. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Not only might memcpy() do a prefetch for read on the source for some architectures (which in turn may end up being slow for an address that isn't in the TLB, like NULL), but you depend on a very much internal Well, I hope a prefetch(NULL) is OK because we are doing millions of them (see include/linux/list.h :) ) detail, since it *could* have been using something like memcpy(dname, name-name, name-len+1); Yes very true, I will change that and push to Andrew for mm testing I was thinking about being able to cache the name into the dentry, do you think it's worth the pain ? (its not SMP safe for example...) static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen) { if (!dentry-d_iname[0]) snprintf(dentry-d_iname, DNAME_INLINE_LEN_MIN, pipe:[%lu], dentry-d_inode-i_ino); strlcpy(buffer, dentry-d_iname, buflen); return buffer; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: I was thinking about being able to cache the name into the dentry, do you think it's worth the pain ? (its not SMP safe for example...) Actually, it *can* be SMP-safe, if you do it right. Something like len = dentry-d_name.len; if (!len) { len = snprintf(dentry-d_iname, DNAME_INLINE_LEN_MIN, pipe:[%lu], dentry-d_inode-i_ino); smp_wmb(); dentry-d_name.len = len; } if (len = buflen) len = buflen-1; memcpy(buffer, dentry-d_iname, len); buffer[len] = 0; return buffer; should work, although it depends on the fact that our snprintf() implementation should be stable (ie if snprintf() modifies the buffer temporarily as it goes along, that would break, but I think our vsnprintf is good in that respect). So you could have two different CPU's doing the snprintf() on the same buffer at the same time (and assigning the length at the same time), but since they'll write the same thing, you don't really care. It's a bit subtle, though. And probably not really worth it. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Hi, Well, I hope a prefetch(NULL) is OK because we are doing millions of them (see include/linux/list.h :) ) Funny you mention this. We found some noticeable ppc64 regressions when moving the dcache to standard list macros and had to do this to fix it up: static inline void prefetch(const void *x) { if (unlikely(!x)) return; __asm__ __volatile__ (dcbt 0,%0 : : r (x)); } Urgh :) Anton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Anton Blanchard wrote: Funny you mention this. We found some noticeable ppc64 regressions when moving the dcache to standard list macros and had to do this to fix it up: static inline void prefetch(const void *x) { if (unlikely(!x)) return; __asm__ __volatile__ (dcbt 0,%0 : : r (x)); } Urgh :) Yeah, I'm not at all surprised. Any implementation of prefetch that doesn't just turn into a no-op if the TLB entry doesn't exist (which makes them weaker for *actual* prefetching) will generally have a hard time with a NULL pointer. Exactly because it will try to do a totally unnecessary TLB fill - and since most CPU's will not cache negative TLB entries, that unnecessary TLB fill will be done over and over and over again.. In general, using software prefetching is just a stupid idea, unless - the prefetch really is very strict (ie for a linked list you do exactly the above kinds of things to make sure that you don't try to prefetch the non-existent end entry) AND - the CPU is stupid (in-order in particular). I think Intel even suggests in their optimization manuals to *not* do software prefetching, because hw can usually simply do better without it. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On 3/7/07, Linus Torvalds [EMAIL PROTECTED] wrote Yeah, I'm not at all surprised. Any implementation of prefetch that doesn't just turn into a no-op if the TLB entry doesn't exist (which makes them weaker for *actual* prefetching) will generally have a hard time with a NULL pointer. Exactly because it will try to do a totally unnecessary TLB fill - and since most CPU's will not cache negative TLB entries, that unnecessary TLB fill will be done over and over and over again.. Data prefetch instructions should indeed avoid page table walks. (Instruction prefetch mechanisms often do induce table walks on ITLB miss.) Not just because of the null pointer case, but because it's quite normal to run off the end of an array in a loop with an embedded prefetch instruction. If you have an extra instruction issue unit that shares the same DTLB, and you know you will really want that data, you can sometimes use it to force DTLB preloads by doing an actual data fetch from the foreseeable page. This is potentially one of the best uses of chip multi-threading on an architecture like Sun's Niagara. (I don't think Intel's hyper-threading works for this purpose; the DTLB is shared but the entries are marked as owned by one thread or the other. HT can be used for L2 cache prefetching, although the results so far seem to be mixed: http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVISED.pdf) In general, using software prefetching is just a stupid idea, unless - the prefetch really is very strict (ie for a linked list you do exactly the above kinds of things to make sure that you don't try to prefetch the non-existent end entry) AND - the CPU is stupid (in-order in particular). I think Intel even suggests in their optimization manuals to *not* do software prefetching, because hw can usually simply do better without it. Not the XScale -- it performs quite poorly without prefetch, as people who have run ARMv5-optimized binaries on it can testify. From the XScale Core Developer's Manual: quote The Intel XScale(r) core has a true prefetch load instruction (PLD). The purpose of this instruction is to preload data into the data and mini-data caches. Data prefetching allows hiding of memory transfer latency while the processor continues to execute instructions. The prefetch is important to compiler and assembly code because judicious use of the prefetch instruction can enormously improve throughput performance of the core. Data prefetch can be applied not only to loops but also to any data references within a block of code. Prefetch also applies to data writing when the memory type is enabled as write allocate The Intel XScale(r) core prefetch load instruction is a true prefetch instruction because the load destination is the data or mini-data cache and not a register. Compilers for processors which have data caches, but do not support prefetch, sometimes use a load instruction to preload the data cache. This technique has the disadvantages of using a register to load data and requiring additional registers for subsequent preloads and thus increasing register pressure. By contrast, the prefetch can be used to reduce register pressure instead of increasing it. The prefetch load is a hint instruction and does not guarantee that the data will be loaded. Whenever the load would cause a fault or a table walk, then the processor will ignore the prefetch instruction, the fault or table walk, and continue processing the next instruction. This is particularly advantageous in the case where a linked list or recursive data structure is terminated by a NULL pointer. Prefetching the NULL pointer will not fault program flow. /quote People's prejudices against prefetch instructions are sometimes traceable to the 3DNow! prefetch(w) botch, which some processors support as no-ops and others are too aggressive about (Opteron prefetches are reputed to be strong, i. e., not dropped on DTLB miss). XScale gets it right. So do most Pentium 4's using the SSE prefetches, according to the IA-32 optimization manual. (Oddly, Prescott seems to have initiated a page table walk on DTLB miss during software prefetch -- just one of many weird Prescott flaws.) I'm guessing Pentium M and its descendants (Core Solo and Duo) get it right but I'm having a hell of a time finding out for sure. Can any of the x86 experts answer this? Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Mar 07, 2007, at 20:25:14, Michael K. Edwards wrote: On 3/7/07, Linus Torvalds [EMAIL PROTECTED] wrote In general, using software prefetching is just a stupid idea, unless - the prefetch really is very strict (ie for a linked list you do exactly the above kinds of things to make sure that you don't try to prefetch the non-existent end entry) AND - the CPU is stupid (in-order in particular). I think Intel even suggests in their optimization manuals to *not* do software prefetching, because hw can usually simply do better without it. Not the XScale -- it performs quite poorly without prefetch, as people who have run ARMv5-optimized binaries on it can testify. The Intel XScale(r) core prefetch load instruction is a true prefetch instruction because the load destination is the data or mini-data cache and not a register. Compilers for processors which have data caches, but do not support prefetch, sometimes use a load instruction to preload the data cache. This technique has the disadvantages of using a register to load data and requiring additional registers for subsequent preloads and thus increasing register pressure. By contrast, the prefetch can be used to reduce register pressure instead of increasing it. The prefetch load is a hint instruction and does not guarantee that the data will be loaded. Whenever the load would cause a fault or a table walk, then the processor will ignore the prefetch instruction, the fault or table walk, and continue processing the next instruction. This is particularly advantageous in the case where a linked list or recursive data structure is terminated by a NULL pointer. Prefetching the NULL pointer will not fault program flow. Prefetching is also fairly critical on a Power4 or G5 PowerPC system as they have a long memory latency; an L2-cache miss can cost 200+ cycles. On such systems the dcbt prefetch instruction brings in a single 128-byte cacheline and has no serializing effects whatsoever, making it ideal for use in a linked-list-traversal inner loop. Cheers, Kyle Moffett - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Michael K. Edwards wrote: People's prejudices against prefetch instructions are sometimes traceable to the 3DNow! prefetch(w) botch, which some processors support as no-ops and others are too aggressive about (Opteron prefetches are reputed to be strong, i. e., not dropped on DTLB miss). No, I just checked, and Intel's own optimization manual makes it clear that you should be careful. They talk about performance penalties due to resource constraints - which makes tons of sense with a core that is good at handling its own resources and could quite possibly use those resources better to actually execute the loads and stores deeper down the instruction pipeline. So it's not just 3DNow! making AMD look bad, or Intel would obviously suggest people use it out of the wazoo ;) XScale gets it right. Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. Calling that getting it right is ludicrous. If anything, it gets things so wrong that prefetching is *required* for good performance. I'm talking about real CPU's with real memory pipelines that already do prefetching in hardware. The better the core is, the less the prefetch helps (and often the more it hurts in comparison to how much it helps). But if you mean doesn't try to fill the TLB on data prefetches, then yes, that's generally the right thing to do. (Oddly, Prescott seems to have initiated a page table walk on DTLB miss during software prefetch -- just one of many weird Prescott flaws.) Netburst in general is *very* happy to do speculative TLB fills, I think. I'm guessing Pentium M and its descendants (Core Solo and Duo) get it right but I'm having a hell of a time finding out for sure. Can any of the x86 experts answer this? I just suspect that the upside for Core 2 Due is likely fairly low. The L2 cache is good, the memory re-ordering is working.. I doubt prefetch helps in generic code that much for things like linked list following, you should probably limit it to code that has *known* access patterns and you know it's not going to be in the cache. (In other words, I bet prefetching can help a lot with MMX/media kind of code, I doubt it's a huge win for for_each_entry()) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Kyle Moffett a écrit : Prefetching is also fairly critical on a Power4 or G5 PowerPC system as they have a long memory latency; an L2-cache miss can cost 200+ cycles. On such systems the dcbt prefetch instruction brings in a single 128-byte cacheline and has no serializing effects whatsoever, making it ideal for use in a linked-list-traversal inner loop. OK, 200 cycles... But what is the cost of the conditional branch you added in prefetch(x) ? if (!x) return; (correctly predicted or not, but do powerPC have a BTB ?) About the NULL 'potential problem', maybe we could use a dummy nil (but mapped) object, and use its address in lists, ie compare for nil instead of NULL. This would avoid : - The conditional test in some prefetch() implementations - The potential TLB problem with the NULL value. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Eric Dumazet a écrit : I would definitly *love* saving dentries for pipes (and sockets too), but how are you going to get the inode ? pipes()/sockets() can use read()/write()/rw_verify_area() and thus need file->f_path.dentry->d_inode (so each pipe needs a separate dentry) Are you suggesting adding a new "struct file_operations" member to get the inode ? Or re-intoducing an inode pointer in struct file ? Crazy ideas : (some readers are going to kill me) 1) Use the low order bit of f_path.dentry to say : this pointer is not a pointer to a dentry but the inode pointer (with the low order bit set to 1) OR 2) file->f_path.dentry set to NULL for this special files (so that we dont need to dput() and cache line ping pong the common dentry each time we __fput() a pipe/socket. Same trick could be used for file->f_path.mnt, because there is a big SMP cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint while these file systems cannot be un-mounted) If dentry is NULL, we get the inode pointer from an overlay of struct file_ra_statef_ra; (because for this special files readahead is unused) This adds some conditional branches of course, but being able to save ram and better use cpu caches might be worth them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: > I would definitly *love* saving dentries for pipes (and sockets too), but how > are you going to get the inode ? I was not planning to touch anything but epoll, signalfd and timerfd files. > pipes()/sockets() can use read()/write()/rw_verify_area() and thus need > file->f_path.dentry->d_inode (so each pipe needs a separate dentry) Currently, they use a single inode, and multiple dentries (to give the name of the class). But this could be changed to a single dentry like Linus was suggesting. I'll wait for Al's reply before doing anything. Memory saving can be something, on top of the already big one of avoiding code duplication. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Linus Torvalds a écrit : I assume that the *only* reason for having multiple dentries is really just the output in /proc//fd/, right? Or is there any other reason to have separate dentries for these pseudo-files? It's a bit sad to waste that much memory (and time) on something like that. I bet that the dentry setup is a noticeable part of the whole sigfd()/timerfd() setup. It's likely also a big part of any memory footprint if you have lots of them. So how about just doing: - do a single dentry - make a "struct file_operations" member function that prints out the name of the thing in /proc//fd/, and which *defaults* to just doing the d_path() on the dentry, but special filesystems like this could do something else (like print out a fake inode number from the "file->f_private_data" information) There seems to really be no downsides to that approach. No existing filesystem will even notice (they'll all have NULL in the new f_op member), and it would allow pipes etc to be sped up and use less memory. I would definitly *love* saving dentries for pipes (and sockets too), but how are you going to get the inode ? pipes()/sockets() can use read()/write()/rw_verify_area() and thus need file->f_path.dentry->d_inode (so each pipe needs a separate dentry) Are you suggesting adding a new "struct file_operations" member to get the inode ? Or re-intoducing an inode pointer in struct file ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Linus Torvalds wrote: > > > On Tue, 6 Mar 2007, Davide Libenzi wrote: > > > > I'm OK with everything that avoid code duplication due to those fake > > inodes. The change can be localized inside the existing API, so it doesn't > > really affect me externally. > > Can you try with the first patch version that doesn't do anything special > at all, and just uses a single dentry. > > Yeah, the dentry name will be identical, and so you would see something > like "7 -> signalfd:signalfd" when you do "ls -l /proc//fd/" on a > task that has such a special file descriptor (with no way to tell > different timerfd's and signalfd's apart), but I think it's better to > start off simple than to overdesign things. > > And trying to tell them apart sounds a bit like overdesign, if only > because I really don't see why anybody would really *care*. So it's a > timer for poll/select/epoll - why care about anything else? The code would not change/shrink much with the single dentry. We'd save memory, and we'd lose the capability of seeing aino:[CLASS]. Both ways are fine with me. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Davide Libenzi wrote: > > I'm OK with everything that avoid code duplication due to those fake > inodes. The change can be localized inside the existing API, so it doesn't > really affect me externally. Can you try with the first patch version that doesn't do anything special at all, and just uses a single dentry. Yeah, the dentry name will be identical, and so you would see something like "7 -> signalfd:signalfd" when you do "ls -l /proc//fd/" on a task that has such a special file descriptor (with no way to tell different timerfd's and signalfd's apart), but I think it's better to start off simple than to overdesign things. And trying to tell them apart sounds a bit like overdesign, if only because I really don't see why anybody would really *care*. So it's a timer for poll/select/epoll - why care about anything else? If it really really turns out that people care, we know how we can do it. We'd hook into "proc_fd_link()" and we'd allow a per-file mntget/dget that we could use to let special filesystems create fake entries on demand. So it's not impossible, it's just likely simply not needed. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Linus Torvalds wrote: > > [ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He > has veto powers even over my proposals ;^] > > On Tue, 6 Mar 2007, Davide Libenzi wrote: > > > > I currently have the dentry to carry the name of the file* class it is > > linked to. I'd prefer to keep it that way, unless there are huge factors > > against. > > I assume that the *only* reason for having multiple dentries is really > just the output in /proc//fd/, right? Or is there any other reason to > have separate dentries for these pseudo-files? > > It's a bit sad to waste that much memory (and time) on something like > that. I bet that the dentry setup is a noticeable part of the whole > sigfd()/timerfd() setup. It's likely also a big part of any memory > footprint if you have lots of them. > > So how about just doing: > - do a single dentry > - make a "struct file_operations" member function that prints out the >name of the thing in /proc//fd/, and which *defaults* to just >doing the d_path() on the dentry, but special filesystems like this >could do something else (like print out a fake inode number from the >"file->f_private_data" information) > > There seems to really be no downsides to that approach. No existing > filesystem will even notice (they'll all have NULL in the new f_op > member), and it would allow pipes etc to be sped up and use less memory. I'm OK with everything that avoid code duplication due to those fake inodes. The change can be localized inside the existing API, so it doesn't really affect me externally. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
[ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He has veto powers even over my proposals ;^] On Tue, 6 Mar 2007, Davide Libenzi wrote: > > I currently have the dentry to carry the name of the file* class it is > linked to. I'd prefer to keep it that way, unless there are huge factors > against. I assume that the *only* reason for having multiple dentries is really just the output in /proc//fd/, right? Or is there any other reason to have separate dentries for these pseudo-files? It's a bit sad to waste that much memory (and time) on something like that. I bet that the dentry setup is a noticeable part of the whole sigfd()/timerfd() setup. It's likely also a big part of any memory footprint if you have lots of them. So how about just doing: - do a single dentry - make a "struct file_operations" member function that prints out the name of the thing in /proc//fd/, and which *defaults* to just doing the d_path() on the dentry, but special filesystems like this could do something else (like print out a fake inode number from the "file->f_private_data" information) There seems to really be no downsides to that approach. No existing filesystem will even notice (they'll all have NULL in the new f_op member), and it would allow pipes etc to be sped up and use less memory. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Avi Kivity wrote: > Right. For kvmfs this isn't a problem as there's a 1:1 relationship > between synthetic inodes and dentries. Perhaps d_alloc_anon() could be > extended to avoid the scan if it's a problem. I currently have the dentry to carry the name of the file* class it is linked to. I'd prefer to keep it that way, unless there are huge factors against. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Avi Kivity wrote: Right. For kvmfs this isn't a problem as there's a 1:1 relationship between synthetic inodes and dentries. Perhaps d_alloc_anon() could be extended to avoid the scan if it's a problem. I currently have the dentry to carry the name of the file* class it is linked to. I'd prefer to keep it that way, unless there are huge factors against. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
[ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He has veto powers even over my proposals ;^] On Tue, 6 Mar 2007, Davide Libenzi wrote: I currently have the dentry to carry the name of the file* class it is linked to. I'd prefer to keep it that way, unless there are huge factors against. I assume that the *only* reason for having multiple dentries is really just the output in /proc/pid/fd/, right? Or is there any other reason to have separate dentries for these pseudo-files? It's a bit sad to waste that much memory (and time) on something like that. I bet that the dentry setup is a noticeable part of the whole sigfd()/timerfd() setup. It's likely also a big part of any memory footprint if you have lots of them. So how about just doing: - do a single dentry - make a struct file_operations member function that prints out the name of the thing in /proc/pid/fd/, and which *defaults* to just doing the d_path() on the dentry, but special filesystems like this could do something else (like print out a fake inode number from the file-f_private_data information) There seems to really be no downsides to that approach. No existing filesystem will even notice (they'll all have NULL in the new f_op member), and it would allow pipes etc to be sped up and use less memory. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Linus Torvalds wrote: [ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He has veto powers even over my proposals ;^] On Tue, 6 Mar 2007, Davide Libenzi wrote: I currently have the dentry to carry the name of the file* class it is linked to. I'd prefer to keep it that way, unless there are huge factors against. I assume that the *only* reason for having multiple dentries is really just the output in /proc/pid/fd/, right? Or is there any other reason to have separate dentries for these pseudo-files? It's a bit sad to waste that much memory (and time) on something like that. I bet that the dentry setup is a noticeable part of the whole sigfd()/timerfd() setup. It's likely also a big part of any memory footprint if you have lots of them. So how about just doing: - do a single dentry - make a struct file_operations member function that prints out the name of the thing in /proc/pid/fd/, and which *defaults* to just doing the d_path() on the dentry, but special filesystems like this could do something else (like print out a fake inode number from the file-f_private_data information) There seems to really be no downsides to that approach. No existing filesystem will even notice (they'll all have NULL in the new f_op member), and it would allow pipes etc to be sped up and use less memory. I'm OK with everything that avoid code duplication due to those fake inodes. The change can be localized inside the existing API, so it doesn't really affect me externally. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Davide Libenzi wrote: I'm OK with everything that avoid code duplication due to those fake inodes. The change can be localized inside the existing API, so it doesn't really affect me externally. Can you try with the first patch version that doesn't do anything special at all, and just uses a single dentry. Yeah, the dentry name will be identical, and so you would see something like 7 - signalfd:signalfd when you do ls -l /proc/pid/fd/ on a task that has such a special file descriptor (with no way to tell different timerfd's and signalfd's apart), but I think it's better to start off simple than to overdesign things. And trying to tell them apart sounds a bit like overdesign, if only because I really don't see why anybody would really *care*. So it's a timer for poll/select/epoll - why care about anything else? If it really really turns out that people care, we know how we can do it. We'd hook into proc_fd_link() and we'd allow a per-file mntget/dget that we could use to let special filesystems create fake entries on demand. So it's not impossible, it's just likely simply not needed. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Linus Torvalds wrote: On Tue, 6 Mar 2007, Davide Libenzi wrote: I'm OK with everything that avoid code duplication due to those fake inodes. The change can be localized inside the existing API, so it doesn't really affect me externally. Can you try with the first patch version that doesn't do anything special at all, and just uses a single dentry. Yeah, the dentry name will be identical, and so you would see something like 7 - signalfd:signalfd when you do ls -l /proc/pid/fd/ on a task that has such a special file descriptor (with no way to tell different timerfd's and signalfd's apart), but I think it's better to start off simple than to overdesign things. And trying to tell them apart sounds a bit like overdesign, if only because I really don't see why anybody would really *care*. So it's a timer for poll/select/epoll - why care about anything else? The code would not change/shrink much with the single dentry. We'd save memory, and we'd lose the capability of seeing aino:[CLASS]. Both ways are fine with me. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Linus Torvalds a écrit : I assume that the *only* reason for having multiple dentries is really just the output in /proc/pid/fd/, right? Or is there any other reason to have separate dentries for these pseudo-files? It's a bit sad to waste that much memory (and time) on something like that. I bet that the dentry setup is a noticeable part of the whole sigfd()/timerfd() setup. It's likely also a big part of any memory footprint if you have lots of them. So how about just doing: - do a single dentry - make a struct file_operations member function that prints out the name of the thing in /proc/pid/fd/, and which *defaults* to just doing the d_path() on the dentry, but special filesystems like this could do something else (like print out a fake inode number from the file-f_private_data information) There seems to really be no downsides to that approach. No existing filesystem will even notice (they'll all have NULL in the new f_op member), and it would allow pipes etc to be sped up and use less memory. I would definitly *love* saving dentries for pipes (and sockets too), but how are you going to get the inode ? pipes()/sockets() can use read()/write()/rw_verify_area() and thus need file-f_path.dentry-d_inode (so each pipe needs a separate dentry) Are you suggesting adding a new struct file_operations member to get the inode ? Or re-intoducing an inode pointer in struct file ? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Wed, 7 Mar 2007, Eric Dumazet wrote: I would definitly *love* saving dentries for pipes (and sockets too), but how are you going to get the inode ? I was not planning to touch anything but epoll, signalfd and timerfd files. pipes()/sockets() can use read()/write()/rw_verify_area() and thus need file-f_path.dentry-d_inode (so each pipe needs a separate dentry) Currently, they use a single inode, and multiple dentries (to give the name of the class). But this could be changed to a single dentry like Linus was suggesting. I'll wait for Al's reply before doing anything. Memory saving can be something, on top of the already big one of avoiding code duplication. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Eric Dumazet a écrit : I would definitly *love* saving dentries for pipes (and sockets too), but how are you going to get the inode ? pipes()/sockets() can use read()/write()/rw_verify_area() and thus need file-f_path.dentry-d_inode (so each pipe needs a separate dentry) Are you suggesting adding a new struct file_operations member to get the inode ? Or re-intoducing an inode pointer in struct file ? Crazy ideas : (some readers are going to kill me) 1) Use the low order bit of f_path.dentry to say : this pointer is not a pointer to a dentry but the inode pointer (with the low order bit set to 1) OR 2) file-f_path.dentry set to NULL for this special files (so that we dont need to dput() and cache line ping pong the common dentry each time we __fput() a pipe/socket. Same trick could be used for file-f_path.mnt, because there is a big SMP cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint while these file systems cannot be un-mounted) If dentry is NULL, we get the inode pointer from an overlay of struct file_ra_statef_ra; (because for this special files readahead is unused) This adds some conditional branches of course, but being able to save ram and better use cpu caches might be worth them. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Avi Kivity wrote: > > /* File callbacks that implement the eventpoll file behaviour */ > > static const struct file_operations eventpoll_fops = { > > .release= ep_eventpoll_close, > > @@ -763,15 +767,18 @@ > > * using the inode number. > > */ > > error = -ENOMEM; > > - sprintf(name, "[%lu]", inode->i_ino); > > + sprintf(name, "[%p]", ep); > > this.name = name; > > this.len = strlen(name); > > - this.hash = inode->i_ino; > > + this.hash = 0; > > dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, ); > > if (!dentry) > > goto eexit_4; > > dentry->d_op = _dentry_operations; > > - d_add(dentry, inode); > > + /* Do not publish this dentry inside the global dentry hash table */ > > + dentry->d_flags &= ~DCACHE_UNHASHED; > > + d_instantiate(dentry, inode); > > + > > > > I've used d_alloc_anon() in similar code in kvmfs. You lose the ep > pointer in the name, but as a kernel address it isn't particularly > useful to userspace. That ends up calling two times the find alias for an unhashed dentry, no? And this linearly scans the inode's dentry list, no? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Davide Libenzi a écrit : * using the inode number. */ error = -ENOMEM; - sprintf(name, "[%lu]", inode->i_ino); + sprintf(name, "[%p]", ep); this.name = name; this.len = strlen(name); Small remark : you can avoid strlen(), since sprintf gives you the value :) this.len = sprintf(name, "[%p]", ep); Also, your patch description is not complete : you forgot to say that epoll dentries are not anymore hashed into global dcache hashtable :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] epoll use a single inode ...
Epoll does not keep any private data attached to its inode, so there'd be no need to allocate one inode per fd. For epoll, the inode is just a placeholder for the file operations and could be shared by all instances. I'd like to use the same optimization even for the upcoming file-based objects, so if you see problems let me know. One that Al was pointing out was that an fstat(2) over an epoll fd would show the same st_ino. IMO that should be fine since an fstat(2) over an epoll fd is not something you want to do in any case and expecting meaningfull results. Signed-off-by: Davide Libenzi - Davide eventpoll.c | 36 1 file changed, 32 insertions(+), 4 deletions(-) Index: linux-2.6.20.ep2/fs/eventpoll.c === --- linux-2.6.20.ep2.orig/fs/eventpoll.c2007-03-04 14:40:01.0 -0800 +++ linux-2.6.20.ep2/fs/eventpoll.c 2007-03-05 13:03:52.0 -0800 @@ -258,6 +258,7 @@ int maxevents, long timeout); static int eventpollfs_delete_dentry(struct dentry *dentry); static struct inode *ep_eventpoll_inode(void); +static struct inode *ep_create_inode(void); static int eventpollfs_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt); @@ -279,6 +280,9 @@ /* Virtual fs used to allocate inodes for eventpoll files */ static struct vfsmount *eventpoll_mnt __read_mostly; +/* Placeholder inode for eventpoll fds */ +static struct inode *eventpoll_inode; + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { .release= ep_eventpoll_close, @@ -763,15 +767,18 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, "[%lu]", inode->i_ino); + sprintf(name, "[%p]", ep); this.name = name; this.len = strlen(name); - this.hash = inode->i_ino; + this.hash = 0; dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, ); if (!dentry) goto eexit_4; dentry->d_op = _dentry_operations; - d_add(dentry, inode); + /* Do not publish this dentry inside the global dentry hash table */ + dentry->d_flags &= ~DCACHE_UNHASHED; + d_instantiate(dentry, inode); + file->f_path.mnt = mntget(eventpoll_mnt); file->f_path.dentry = dentry; file->f_mapping = inode->i_mapping; @@ -1555,6 +1562,11 @@ static int eventpollfs_delete_dentry(struct dentry *dentry) { + /* +* We faked vfs to believe the dentry was hashed when we created it. +* Now we restore the flag so that dput() will work correctly. +*/ + dentry->d_flags |= DCACHE_UNHASHED; return 1; } @@ -1562,6 +1574,17 @@ static struct inode *ep_eventpoll_inode(void) { + + return igrab(eventpoll_inode); +} + +/* + * A single inode exist for all eventpoll files. On the contrary of pipes, + * eventpoll inodes has no per-instance data associated, so we can avoid + * the allocation of multiple of them. + */ +static struct inode *ep_create_inode(void) +{ int error = -ENOMEM; struct inode *inode = new_inode(eventpoll_mnt->mnt_sb); @@ -1626,10 +1649,14 @@ /* Mount the above commented virtual file system */ eventpoll_mnt = kern_mount(_fs_type); - error = PTR_ERR(eventpoll_mnt); if (IS_ERR(eventpoll_mnt)) goto epanic; + /* Create the single instance of inode for all eventpoll fds */ + eventpoll_inode = ep_create_inode(); + if (IS_ERR(eventpoll_inode)) + goto epanic; + DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.\n", current)); return 0; @@ -1642,6 +1669,7 @@ static void __exit eventpoll_exit(void) { /* Undo all operations done inside eventpoll_init() */ + iput(eventpoll_inode); unregister_filesystem(_fs_type); mntput(eventpoll_mnt); kmem_cache_destroy(pwq_cache); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] epoll use a single inode ...
Epoll does not keep any private data attached to its inode, so there'd be no need to allocate one inode per fd. For epoll, the inode is just a placeholder for the file operations and could be shared by all instances. I'd like to use the same optimization even for the upcoming file-based objects, so if you see problems let me know. One that Al was pointing out was that an fstat(2) over an epoll fd would show the same st_ino. IMO that should be fine since an fstat(2) over an epoll fd is not something you want to do in any case and expecting meaningfull results. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide eventpoll.c | 36 1 file changed, 32 insertions(+), 4 deletions(-) Index: linux-2.6.20.ep2/fs/eventpoll.c === --- linux-2.6.20.ep2.orig/fs/eventpoll.c2007-03-04 14:40:01.0 -0800 +++ linux-2.6.20.ep2/fs/eventpoll.c 2007-03-05 13:03:52.0 -0800 @@ -258,6 +258,7 @@ int maxevents, long timeout); static int eventpollfs_delete_dentry(struct dentry *dentry); static struct inode *ep_eventpoll_inode(void); +static struct inode *ep_create_inode(void); static int eventpollfs_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt); @@ -279,6 +280,9 @@ /* Virtual fs used to allocate inodes for eventpoll files */ static struct vfsmount *eventpoll_mnt __read_mostly; +/* Placeholder inode for eventpoll fds */ +static struct inode *eventpoll_inode; + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { .release= ep_eventpoll_close, @@ -763,15 +767,18 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); + sprintf(name, [%p], ep); this.name = name; this.len = strlen(name); - this.hash = inode-i_ino; + this.hash = 0; dentry = d_alloc(eventpoll_mnt-mnt_sb-s_root, this); if (!dentry) goto eexit_4; dentry-d_op = eventpollfs_dentry_operations; - d_add(dentry, inode); + /* Do not publish this dentry inside the global dentry hash table */ + dentry-d_flags = ~DCACHE_UNHASHED; + d_instantiate(dentry, inode); + file-f_path.mnt = mntget(eventpoll_mnt); file-f_path.dentry = dentry; file-f_mapping = inode-i_mapping; @@ -1555,6 +1562,11 @@ static int eventpollfs_delete_dentry(struct dentry *dentry) { + /* +* We faked vfs to believe the dentry was hashed when we created it. +* Now we restore the flag so that dput() will work correctly. +*/ + dentry-d_flags |= DCACHE_UNHASHED; return 1; } @@ -1562,6 +1574,17 @@ static struct inode *ep_eventpoll_inode(void) { + + return igrab(eventpoll_inode); +} + +/* + * A single inode exist for all eventpoll files. On the contrary of pipes, + * eventpoll inodes has no per-instance data associated, so we can avoid + * the allocation of multiple of them. + */ +static struct inode *ep_create_inode(void) +{ int error = -ENOMEM; struct inode *inode = new_inode(eventpoll_mnt-mnt_sb); @@ -1626,10 +1649,14 @@ /* Mount the above commented virtual file system */ eventpoll_mnt = kern_mount(eventpoll_fs_type); - error = PTR_ERR(eventpoll_mnt); if (IS_ERR(eventpoll_mnt)) goto epanic; + /* Create the single instance of inode for all eventpoll fds */ + eventpoll_inode = ep_create_inode(); + if (IS_ERR(eventpoll_inode)) + goto epanic; + DNPRINTK(3, (KERN_INFO [%p] eventpoll: successfully initialized.\n, current)); return 0; @@ -1642,6 +1669,7 @@ static void __exit eventpoll_exit(void) { /* Undo all operations done inside eventpoll_init() */ + iput(eventpoll_inode); unregister_filesystem(eventpoll_fs_type); mntput(eventpoll_mnt); kmem_cache_destroy(pwq_cache); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
Davide Libenzi a écrit : * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); + sprintf(name, [%p], ep); this.name = name; this.len = strlen(name); Small remark : you can avoid strlen(), since sprintf gives you the value :) this.len = sprintf(name, [%p], ep); Also, your patch description is not complete : you forgot to say that epoll dentries are not anymore hashed into global dcache hashtable :) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] epoll use a single inode ...
On Tue, 6 Mar 2007, Avi Kivity wrote: /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { .release= ep_eventpoll_close, @@ -763,15 +767,18 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); + sprintf(name, [%p], ep); this.name = name; this.len = strlen(name); - this.hash = inode-i_ino; + this.hash = 0; dentry = d_alloc(eventpoll_mnt-mnt_sb-s_root, this); if (!dentry) goto eexit_4; dentry-d_op = eventpollfs_dentry_operations; - d_add(dentry, inode); + /* Do not publish this dentry inside the global dentry hash table */ + dentry-d_flags = ~DCACHE_UNHASHED; + d_instantiate(dentry, inode); + I've used d_alloc_anon() in similar code in kvmfs. You lose the ep pointer in the name, but as a kernel address it isn't particularly useful to userspace. That ends up calling two times the find alias for an unhashed dentry, no? And this linearly scans the inode's dentry list, no? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/