Re: [patch] epoll use a single inode ...

2007-03-08 Thread Anton Blanchard

Hi,

> Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise  
> exceptions, doesn't have any side effects, takes only enough CPU to  
> decode the address, and is ignored if it would have to do anything  
> other than load the cacheline from RAM.  Prefetch streams are halted  
> when they reach the end of a page boundary (no trapping to the MMU)  
> and if the TLB entry isn't present then they would asynchronously  
> abort.  

It depends on the implementation and the HID bit settings. Some do walk
the MMU hashtable if it isnt in the TLB.

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Anton Blanchard

> OK, 200 cycles...
> 
> But what is the cost of the conditional branch you added in prefetch(x) ?

Much less than the tablewalk. On ppc64 a tablewalk of an address that is
not populated in the hashtable will incur 2 cacheline lookups (primary
and secondary buckets). This plus the MMU state machine overhead adds up.

Cue Linus rant about PowerPC MMU :)

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Anton Blanchard

> Yeah, I'm not at all surprised. Any implementation of "prefetch" that 
> doesn't just turn into a no-op if the TLB entry doesn't exist (which makes 
> them weaker for *actual* prefetching) will generally have a hard time with 
> a NULL pointer. Exactly because it will try to do a totally unnecessary 
> TLB fill - and since most CPU's will not cache negative TLB entries, that 
> unnecessary TLB fill will be done over and over and over again..

Yeah this is exactly what we were seeing :)

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Kyle Moffett

On Mar 08, 2007, at 02:24:04, Eric Dumazet wrote:

Kyle Moffett a écrit :
Prefetching is also fairly critical on a Power4 or G5 PowerPC  
system as they have a long memory latency; an L2-cache miss can  
cost 200+ cycles.  On such systems the "dcbt" prefetch instruction  
brings in a single 128-byte cacheline and has no serializing  
effects whatsoever, making it ideal for use in a linked-list- 
traversal inner loop.


OK, 200 cycles...

But what is the cost of the conditional branch you added in prefetch 
(x) ?


if (!x) return;

(correctly predicted or not, but do powerPC have a BTB ?)


Well, PowerPC "dcbt" does prefetch() correctly, it doesn't ever raise  
exceptions, doesn't have any side effects, takes only enough CPU to  
decode the address, and is ignored if it would have to do anything  
other than load the cacheline from RAM.  Prefetch streams are halted  
when they reach the end of a page boundary (no trapping to the MMU)  
and if the TLB entry isn't present then they would asynchronously  
abort.  This is a suitable prefetch implementation for G5:


  #define prefetch(x) __asm__ __volatile__ ("dcbt 0, %0"::"r"(x));

On GCC4+ the "__builtin_prefetch" built-in functions is probably  
mildly better tuned for most architectures:


  #define PREFETCH_RD 0
  #define PREFETCH_WR 1
  #define PREFETCH_LOCALITY_NONE 0
  #define PREFETCH_LOCALITY_LOW  1
  #define PREFETCH_LOCALITY_MED  2
  #define PREFETCH_LOCALITY_HIGH 3
  #define prefetch(x, args...) __builtin_prefetch(x ,##args)

The pertinent GCC docs:
This function is used to minimize cache-miss latency by moving data  
into a cache before it is accessed. You can insert calls to  
__builtin_prefetch into code for which you know addresses of data  
in memory that is likely to be accessed soon. If the target  
supports them, data prefetch instructions will be generated. If the  
prefetch is done early enough before the access then the data will  
be in the cache by the time it is accessed.


The value of addr is the address of the memory to prefetch. There  
are two optional arguments, rw and locality. The value of rw is a  
compile-time constant one or zero; one means that the prefetch is  
preparing for a write to the memory address and zero, the default,  
means that the prefetch is preparing for a read. The value locality  
must be a compile-time constant integer between zero and three. A  
value of zero means that the data has no temporal locality, so it  
need not be left in the cache after the access. A value of three  
means that the data has a high degree of temporal locality and  
should be left in all levels of cache possible. Values of one and  
two mean, respectively, a low or moderate degree of temporal  
locality. The default is three.

for (i = 0; i < n; i++) {
a[i] = a[i] + b[i];
__builtin_prefetch ([i+j], 1, 1);
__builtin_prefetch ([i+j], 0, 1);
/* ... */
}

Data prefetch does not generate faults if addr is invalid, but the  
address expression itself must be valid. For example, a prefetch of  
p->next will not fault if p->next is not a valid address, but  
evaluation will fault if p is not a valid address.


If the target does not support data prefetch, the address  
expression is evaluated if it includes side effects but no other  
code is generated and GCC does not issue a warning.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Bob Copeland

+   its done only when the path is needed.). Real filesystems probably



+   dont want to use it, because their dentries are present in global


Pedantry: it's and don't?

-Bob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Valdis . Kletnieks
On Thu, 08 Mar 2007 08:24:04 +0100, Eric Dumazet said:
>
> But what is the cost of the conditional branch you added in prefetch(x) ?
> 
> if (!x) return;
> 
> (correctly predicted or not, but do powerPC have a BTB ?)
> 
> About the NULL 'potential problem', maybe we could use a dummy nil (but 
> mapped) object, and use its address in lists, ie compare for  instead of 
> NULL. This would avoid :
> 
> - The conditional test in some prefetch() implementations
> - The potential TLB problem with the NULL value.

You avoid those two, but add the very high likelyhood that a statement of the
form 'if (!x) {}' will creep back in and bug out.  I doubt that changing
the form of a basic C idiom to save a few cycles is worth it, especially when
the (!x) form can be tested without a memory fetch, but (x != nil) may require
fetching 'nil' (remember - the x86 is a very popular chipset, but register
starved - if the loop is any size at all, 'nil' may require a reload each
time around, costing you the win you thought you had


pgpWRgqXbzEAw.pgp
Description: PGP signature


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Linus Torvalds


On Thu, 8 Mar 2007, Eric Dumazet wrote:
> 
> Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
> CC: Christoph Hellwig <[EMAIL PROTECTED]>
> CC: Linus Torvalds <[EMAIL PROTECTED]>

Acked-by: Linus Torvalds <[EMAIL PROTECTED]>

Except you should fix the subject line when you send it out to Andrew ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Christoph Hellwig

I'm sorry about complaining again, but..

>  
> + /*
> +  * Some filesystems want to provide dentry's pathname themselves,
> +  * instead of pre-building names at dentry creation.
> +  */

It's not _some_ filesystems.  If real filesystem did this we'd be in
horrible trouble.  It's "synthetic, non-mountable" filesystem.

Except for this the patch looks fine to me, thanks Eric!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Eric Dumazet
Thanks again for your feedback Christoph :)

I think I addressed all your remarks.

Thank you

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: Linus Torvalds <[EMAIL PROTECTED]>
CC: Davide Libenzi 
CC: Al Viro <[EMAIL PROTECTED]>

 Documentation/filesystems/Locking |2 ++
 Documentation/filesystems/vfs.txt |   12 +++-
 fs/dcache.c   |7 +++
 fs/pipe.c |   15 +--
 include/linux/dcache.h|1 +
 net/socket.c  |   17 ++---
 6 files changed, 40 insertions(+), 14 deletions(-)

--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-08 11:57:41.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt  2007-03-08 
10:14:38.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt   2007-03-08 
10:34:34.0 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
VFS calls iput(). If you define this method, you must call
iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+   Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+   pathname generation. (Instead of doing it when dentry is created,
+   its done only when the path is needed.). Real filesystems probably
+   dont want to use it, because their dentries are present in global
+   dcache hash, so their hash should be an invariant. As no lock is
+   held, d_dname() should not try to modify the dentry itself, unless
+   appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking  2007-03-08 
10:29:04.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking   2007-03-08 
10:29:04.0 +0100
@@ -15,6 +15,7 @@ prototypes:
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
none have BKL
@@ -25,6 +26,7 @@ d_compare:no  yes no  
no 
 d_delete:  yes no  yes no
 d_release: no  no  no  yes
 d_iput:no  no  no  yes
+d_dname:   no  no  no  no
 
 --- inode_operations --- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-08 11:57:41.0 +0100
@@ -1823,6 +1823,13 @@ char * 

Re: [patch] epoll use a single inode ...

2007-03-08 Thread Christoph Hellwig
On Thu, Mar 08, 2007 at 10:42:21AM +0100, Eric Dumazet wrote:
> @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
>   struct vfsmount *rootmnt;
>   struct dentry *root;
>  
> + if (dentry->d_op && dentry->d_op->d_dname)
> + return (dentry->d_op->d_dname)(dentry, buf, buflen);

Please don't put braces around the function pointer call.  Also I
think we really want a comment here why we need this call.

> +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)

normally this would be:


static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)

> +static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)

same here.

> - char name[32];
>  
> - this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
> - this.name = name;
> + this.len = 0;
> + this.name = "";
>   this.hash = 0;

Can you add a helper to init this one?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Eric Dumazet
On Thursday 08 March 2007 09:56, Christoph Hellwig wrote:
> This patch needs a lot more documentation.  It needs some really big
> comments on why this should never ever be used for a real filesystem
> (real as in user mountable), and probably add an assert for that
> invariant somewhere.  Please also update Documentation/filesystems/Locking
> and Documentation/filesystems/vfs.txt for it.

Thank you for your feedback Christoph

Here is the version I was about to push to -mm, so if nobody complains, I ask 
Andrew to push it to mm so that it can reach 2.6.22 target

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
CC: Christoph Hellwig <[EMAIL PROTECTED]>
CC: Linus Torvalds <[EMAIL PROTECTED]>
CC: Davide Libenzi 
CC: Al Viro <[EMAIL PROTECTED]>

 Documentation/filesystems/Locking |2 ++
 Documentation/filesystems/vfs.txt |   12 +++-
 fs/dcache.c   |3 +++
 fs/pipe.c |   12 +---
 include/linux/dcache.h|1 +
 net/socket.c  |   13 ++---
 6 files changed, 36 insertions(+), 7 deletions(-)
--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-08 10:14:38.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt  2007-03-08 
10:14:38.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt   2007-03-08 
10:34:34.0 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
VFS calls iput(). If you define this method, you must call
iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+   Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+   pathname generation. (Instead of doing it when dentry is created,
+   its done only when the path is needed.). Real filesystems probably
+   dont want to use it, because their dentries are present in global
+   dcache hash, so their hash should be an invariant. As no lock is
+   held, d_dname() should not try to modify the dentry itself, unless
+   appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking  2007-03-08 
10:29:04.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking   2007-03-08 
10:29:04.0 +0100
@@ -15,6 +15,7 @@ prototypes:
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
none have BKL
@@ -25,6 +26,7 @@ d_compare:no  yes no  
no 
 d_delete:  yes no  yes no
 d_release: no

Re: [patch] epoll use a single inode ...

2007-03-08 Thread Christoph Hellwig
This patch needs a lot more documentation.  It needs some really big
comments on why this should never ever be used for a real filesystem
(real as in user mountable), and probably add an assert for that
invariant somewhere.  Please also update Documentation/filesystems/Locking
and Documentation/filesystems/vfs.txt for it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Michael K. Edwards

On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

No, I just checked, and Intel's own optimization manual makes it clear
that you should be careful. They talk about performance penalties due to
resource constraints - which makes tons of sense with a core that is good
at handling its own resources and could quite possibly use those resources
better to actually execute the loads and stores deeper down the
instruction pipeline.


Certainly you should be careful -- and that usually means leaving it
up to the compiler.  But hinting to the compiler can help; there may
be an analogue of the (un)likely macros waiting to be implemented for
loop prefetch.  And while out-of-order execution and fancy hardware
prefetch streams greatly reduce the need for explicit prefetch in
general code, there's no substitute for the cache _bypassing_
instructions when trying to avoid excessive cache eviction in DDoS
situations.

For instance, if I wind up working on a splay-tree variant of Robert
Olsson's trie/hash work, I'll try to measure the effect of using SSE2
non-temporal stores to write half-open connections to the leaves of
the tree.  That may give some additional improvement in the ability to
keep servicing real load during a SYN flood.


So it's not just 3DNow! making AMD look bad, or Intel would obviously
suggest people use it out of the wazoo ;)


Intel puts a lot of effort into educating compiler writers about the
optimal prefetch insertion strategies for particular cache
architectures.  At the same time, they put out the orange cones to
warn people off of hand-tuning prefetch placement using
micro-benchmarks.  People did that when 3DNow! first came out, with
predictable consequences.


> XScale gets it right.

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching.
Calling that "getting it right" is ludicrous. If anything, it gets things
so wrong that prefetching is *required* for good performance.


That's not an accident.  Hardware prefetch units cost a lot in power
consumption.  Omitting the hardware prefetch unit and drastically
simplifying the pipeline is how they got a design whose clock they
could crank into the stratosphere and still run on battery power.  And
in the network processor space, they can bolt a myriad of on-chip
microengines and still have some prayer of accurately simulating the
patterns of internal bus cycles.  Errors in simulation can still be
fixed up with prefetch instruction placement to put memory accesses
from the XScale core into phases where the data path processors aren't
working so hard.

Moreover, because they're embedded targets and rarely have to run
third-party binaries originally compiled for older cores, it didn't
really cost them anything to say, "Sorry, this chip's performance is
going to suck if your compiler's prefetch insertion isn't properly
tuned."  The _only_ cost is a slightly less dense instruction stream.
That's not trivial but it's manageable; you budget for it, and the
increase in I-cache power consumption is more than made up for by the
reduction in erroneous data prefetches (hardware prefetch gets it
wrong a substantial fraction of the time).


I'm talking about real CPU's with real memory pipelines that already do
prefetching in hardware. The better the core is, the less the prefetch
helps (and often the more it hurts in comparison to how much it helps).


The more sophisticated the core is, the less software prefetch
instructions help.  But more sophisticated isn't always better; it
depends on your target applications.


But if you mean "doesn't try to fill the TLB on data prefetches", then
yes, that's generally the right thing to do.


AOL.


> (Oddly, Prescott seems to have initiated a page table walk on DTLB miss
> during software prefetch -- just one of many weird Prescott flaws.)

Netburst in general is *very* happy to do speculative TLB fills, I think.


Design by micro-benchmark.  :-)  They set out to push the headline MHz
and the real memory bandwidth to the limit in Prescott, and they
succeeded (data at
http://www.digit-life.com/articles2/rmma/rmma-p4.html).  At a horrible
cost in power per clock, and no gain in real application performance.
So NetBurst died a horrible death, and now we have "Intel Core" -- P6
warmed over, with caches sized such that for most applications the
second core ought to be used solely to soak up control path overheads.

Windows runs better on dual-core machines because the NT kernel will
happily eat an entire core doing memory bookkeeping.  Linux could take
a hint here and use the second core largely for interrupt handling and
force-priming the L2 cache on task switch.  (Prefetch instructions
aren't much use here, precisely because they give up on DTLB miss.)
Any kernel code paths that are known to stall a lot because of
usually-cold-cache access patterns (TCP connection establishment, for
instance) can also be punted over to the second core.  If you're
feeling industrious, use non-temporal memory accesses 

Re: [patch] epoll use a single inode ...

2007-03-08 Thread Linus Torvalds


On Thu, 8 Mar 2007, Eric Dumazet wrote:
 
 Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
 CC: Christoph Hellwig [EMAIL PROTECTED]
 CC: Linus Torvalds [EMAIL PROTECTED]

Acked-by: Linus Torvalds [EMAIL PROTECTED]

Except you should fix the subject line when you send it out to Andrew ;)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Valdis . Kletnieks
On Thu, 08 Mar 2007 08:24:04 +0100, Eric Dumazet said:

 But what is the cost of the conditional branch you added in prefetch(x) ?
 
 if (!x) return;
 
 (correctly predicted or not, but do powerPC have a BTB ?)
 
 About the NULL 'potential problem', maybe we could use a dummy nil (but 
 mapped) object, and use its address in lists, ie compare for nil instead of 
 NULL. This would avoid :
 
 - The conditional test in some prefetch() implementations
 - The potential TLB problem with the NULL value.

You avoid those two, but add the very high likelyhood that a statement of the
form 'if (!x) {}' will creep back in and bug out.  I doubt that changing
the form of a basic C idiom to save a few cycles is worth it, especially when
the (!x) form can be tested without a memory fetch, but (x != nil) may require
fetching 'nil' (remember - the x86 is a very popular chipset, but register
starved - if the loop is any size at all, 'nil' may require a reload each
time around, costing you the win you thought you had


pgpWRgqXbzEAw.pgp
Description: PGP signature


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Bob Copeland

+   its done only when the path is needed.). Real filesystems probably



+   dont want to use it, because their dentries are present in global


Pedantry: it's and don't?

-Bob
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Kyle Moffett

On Mar 08, 2007, at 02:24:04, Eric Dumazet wrote:

Kyle Moffett a écrit :
Prefetching is also fairly critical on a Power4 or G5 PowerPC  
system as they have a long memory latency; an L2-cache miss can  
cost 200+ cycles.  On such systems the dcbt prefetch instruction  
brings in a single 128-byte cacheline and has no serializing  
effects whatsoever, making it ideal for use in a linked-list- 
traversal inner loop.


OK, 200 cycles...

But what is the cost of the conditional branch you added in prefetch 
(x) ?


if (!x) return;

(correctly predicted or not, but do powerPC have a BTB ?)


Well, PowerPC dcbt does prefetch() correctly, it doesn't ever raise  
exceptions, doesn't have any side effects, takes only enough CPU to  
decode the address, and is ignored if it would have to do anything  
other than load the cacheline from RAM.  Prefetch streams are halted  
when they reach the end of a page boundary (no trapping to the MMU)  
and if the TLB entry isn't present then they would asynchronously  
abort.  This is a suitable prefetch implementation for G5:


  #define prefetch(x) __asm__ __volatile__ (dcbt 0, %0::r(x));

On GCC4+ the __builtin_prefetch built-in functions is probably  
mildly better tuned for most architectures:


  #define PREFETCH_RD 0
  #define PREFETCH_WR 1
  #define PREFETCH_LOCALITY_NONE 0
  #define PREFETCH_LOCALITY_LOW  1
  #define PREFETCH_LOCALITY_MED  2
  #define PREFETCH_LOCALITY_HIGH 3
  #define prefetch(x, args...) __builtin_prefetch(x ,##args)

The pertinent GCC docs:
This function is used to minimize cache-miss latency by moving data  
into a cache before it is accessed. You can insert calls to  
__builtin_prefetch into code for which you know addresses of data  
in memory that is likely to be accessed soon. If the target  
supports them, data prefetch instructions will be generated. If the  
prefetch is done early enough before the access then the data will  
be in the cache by the time it is accessed.


The value of addr is the address of the memory to prefetch. There  
are two optional arguments, rw and locality. The value of rw is a  
compile-time constant one or zero; one means that the prefetch is  
preparing for a write to the memory address and zero, the default,  
means that the prefetch is preparing for a read. The value locality  
must be a compile-time constant integer between zero and three. A  
value of zero means that the data has no temporal locality, so it  
need not be left in the cache after the access. A value of three  
means that the data has a high degree of temporal locality and  
should be left in all levels of cache possible. Values of one and  
two mean, respectively, a low or moderate degree of temporal  
locality. The default is three.

for (i = 0; i  n; i++) {
a[i] = a[i] + b[i];
__builtin_prefetch (a[i+j], 1, 1);
__builtin_prefetch (b[i+j], 0, 1);
/* ... */
}

Data prefetch does not generate faults if addr is invalid, but the  
address expression itself must be valid. For example, a prefetch of  
p-next will not fault if p-next is not a valid address, but  
evaluation will fault if p is not a valid address.


If the target does not support data prefetch, the address  
expression is evaluated if it includes side effects but no other  
code is generated and GCC does not issue a warning.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Anton Blanchard

 Yeah, I'm not at all surprised. Any implementation of prefetch that 
 doesn't just turn into a no-op if the TLB entry doesn't exist (which makes 
 them weaker for *actual* prefetching) will generally have a hard time with 
 a NULL pointer. Exactly because it will try to do a totally unnecessary 
 TLB fill - and since most CPU's will not cache negative TLB entries, that 
 unnecessary TLB fill will be done over and over and over again..

Yeah this is exactly what we were seeing :)

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Anton Blanchard

 OK, 200 cycles...
 
 But what is the cost of the conditional branch you added in prefetch(x) ?

Much less than the tablewalk. On ppc64 a tablewalk of an address that is
not populated in the hashtable will incur 2 cacheline lookups (primary
and secondary buckets). This plus the MMU state machine overhead adds up.

Cue Linus rant about PowerPC MMU :)

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Anton Blanchard

Hi,

 Well, PowerPC dcbt does prefetch() correctly, it doesn't ever raise  
 exceptions, doesn't have any side effects, takes only enough CPU to  
 decode the address, and is ignored if it would have to do anything  
 other than load the cacheline from RAM.  Prefetch streams are halted  
 when they reach the end of a page boundary (no trapping to the MMU)  
 and if the TLB entry isn't present then they would asynchronously  
 abort.  

It depends on the implementation and the HID bit settings. Some do walk
the MMU hashtable if it isnt in the TLB.

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Michael K. Edwards

On 3/7/07, Linus Torvalds [EMAIL PROTECTED] wrote:

No, I just checked, and Intel's own optimization manual makes it clear
that you should be careful. They talk about performance penalties due to
resource constraints - which makes tons of sense with a core that is good
at handling its own resources and could quite possibly use those resources
better to actually execute the loads and stores deeper down the
instruction pipeline.


Certainly you should be careful -- and that usually means leaving it
up to the compiler.  But hinting to the compiler can help; there may
be an analogue of the (un)likely macros waiting to be implemented for
loop prefetch.  And while out-of-order execution and fancy hardware
prefetch streams greatly reduce the need for explicit prefetch in
general code, there's no substitute for the cache _bypassing_
instructions when trying to avoid excessive cache eviction in DDoS
situations.

For instance, if I wind up working on a splay-tree variant of Robert
Olsson's trie/hash work, I'll try to measure the effect of using SSE2
non-temporal stores to write half-open connections to the leaves of
the tree.  That may give some additional improvement in the ability to
keep servicing real load during a SYN flood.


So it's not just 3DNow! making AMD look bad, or Intel would obviously
suggest people use it out of the wazoo ;)


Intel puts a lot of effort into educating compiler writers about the
optimal prefetch insertion strategies for particular cache
architectures.  At the same time, they put out the orange cones to
warn people off of hand-tuning prefetch placement using
micro-benchmarks.  People did that when 3DNow! first came out, with
predictable consequences.


 XScale gets it right.

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching.
Calling that getting it right is ludicrous. If anything, it gets things
so wrong that prefetching is *required* for good performance.


That's not an accident.  Hardware prefetch units cost a lot in power
consumption.  Omitting the hardware prefetch unit and drastically
simplifying the pipeline is how they got a design whose clock they
could crank into the stratosphere and still run on battery power.  And
in the network processor space, they can bolt a myriad of on-chip
microengines and still have some prayer of accurately simulating the
patterns of internal bus cycles.  Errors in simulation can still be
fixed up with prefetch instruction placement to put memory accesses
from the XScale core into phases where the data path processors aren't
working so hard.

Moreover, because they're embedded targets and rarely have to run
third-party binaries originally compiled for older cores, it didn't
really cost them anything to say, Sorry, this chip's performance is
going to suck if your compiler's prefetch insertion isn't properly
tuned.  The _only_ cost is a slightly less dense instruction stream.
That's not trivial but it's manageable; you budget for it, and the
increase in I-cache power consumption is more than made up for by the
reduction in erroneous data prefetches (hardware prefetch gets it
wrong a substantial fraction of the time).


I'm talking about real CPU's with real memory pipelines that already do
prefetching in hardware. The better the core is, the less the prefetch
helps (and often the more it hurts in comparison to how much it helps).


The more sophisticated the core is, the less software prefetch
instructions help.  But more sophisticated isn't always better; it
depends on your target applications.


But if you mean doesn't try to fill the TLB on data prefetches, then
yes, that's generally the right thing to do.


AOL.


 (Oddly, Prescott seems to have initiated a page table walk on DTLB miss
 during software prefetch -- just one of many weird Prescott flaws.)

Netburst in general is *very* happy to do speculative TLB fills, I think.


Design by micro-benchmark.  :-)  They set out to push the headline MHz
and the real memory bandwidth to the limit in Prescott, and they
succeeded (data at
http://www.digit-life.com/articles2/rmma/rmma-p4.html).  At a horrible
cost in power per clock, and no gain in real application performance.
So NetBurst died a horrible death, and now we have Intel Core -- P6
warmed over, with caches sized such that for most applications the
second core ought to be used solely to soak up control path overheads.

Windows runs better on dual-core machines because the NT kernel will
happily eat an entire core doing memory bookkeeping.  Linux could take
a hint here and use the second core largely for interrupt handling and
force-priming the L2 cache on task switch.  (Prefetch instructions
aren't much use here, precisely because they give up on DTLB miss.)
Any kernel code paths that are known to stall a lot because of
usually-cold-cache access patterns (TCP connection establishment, for
instance) can also be punted over to the second core.  If you're
feeling industrious, use non-temporal memory accesses judiciously in

Re: [patch] epoll use a single inode ...

2007-03-08 Thread Christoph Hellwig
This patch needs a lot more documentation.  It needs some really big
comments on why this should never ever be used for a real filesystem
(real as in user mountable), and probably add an assert for that
invariant somewhere.  Please also update Documentation/filesystems/Locking
and Documentation/filesystems/vfs.txt for it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Eric Dumazet
On Thursday 08 March 2007 09:56, Christoph Hellwig wrote:
 This patch needs a lot more documentation.  It needs some really big
 comments on why this should never ever be used for a real filesystem
 (real as in user mountable), and probably add an assert for that
 invariant somewhere.  Please also update Documentation/filesystems/Locking
 and Documentation/filesystems/vfs.txt for it.

Thank you for your feedback Christoph

Here is the version I was about to push to -mm, so if nobody complains, I ask 
Andrew to push it to mm so that it can reach 2.6.22 target

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
CC: Christoph Hellwig [EMAIL PROTECTED]
CC: Linus Torvalds [EMAIL PROTECTED]
CC: Davide Libenzi davidel@xmailserver.org
CC: Al Viro [EMAIL PROTECTED]

 Documentation/filesystems/Locking |2 ++
 Documentation/filesystems/vfs.txt |   12 +++-
 fs/dcache.c   |3 +++
 fs/pipe.c |   12 +---
 include/linux/dcache.h|1 +
 net/socket.c  |   13 ++---
 6 files changed, 36 insertions(+), 7 deletions(-)
--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-08 10:14:38.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt  2007-03-08 
10:14:38.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt   2007-03-08 
10:34:34.0 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
VFS calls iput(). If you define this method, you must call
iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+   Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+   pathname generation. (Instead of doing it when dentry is created,
+   its done only when the path is needed.). Real filesystems probably
+   dont want to use it, because their dentries are present in global
+   dcache hash, so their hash should be an invariant. As no lock is
+   held, d_dname() should not try to modify the dentry itself, unless
+   appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking  2007-03-08 
10:29:04.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking   2007-03-08 
10:29:04.0 +0100
@@ -15,6 +15,7 @@ prototypes:
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
none have BKL
@@ -25,6 +26,7 @@ d_compare:no  yes no  
no 
 d_delete:  yes no  yes no
 d_release: no  

Re: [patch] epoll use a single inode ...

2007-03-08 Thread Christoph Hellwig
On Thu, Mar 08, 2007 at 10:42:21AM +0100, Eric Dumazet wrote:
 @@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
   struct vfsmount *rootmnt;
   struct dentry *root;
  
 + if (dentry-d_op  dentry-d_op-d_dname)
 + return (dentry-d_op-d_dname)(dentry, buf, buflen);

Please don't put braces around the function pointer call.  Also I
think we really want a comment here why we need this call.

 +static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)

normally this would be:


static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)

 +static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)

same here.

 - char name[32];
  
 - this.len = sprintf(name, [%lu], SOCK_INODE(sock)-i_ino);
 - this.name = name;
 + this.len = 0;
 + this.name = ;
   this.hash = 0;

Can you add a helper to init this one?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-08 Thread Eric Dumazet
Thanks again for your feedback Christoph :)

I think I addressed all your remarks.

Thank you

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to build a pathname 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
CC: Christoph Hellwig [EMAIL PROTECTED]
CC: Linus Torvalds [EMAIL PROTECTED]
CC: Davide Libenzi davidel@xmailserver.org
CC: Al Viro [EMAIL PROTECTED]

 Documentation/filesystems/Locking |2 ++
 Documentation/filesystems/vfs.txt |   12 +++-
 fs/dcache.c   |7 +++
 fs/pipe.c |   15 +--
 include/linux/dcache.h|1 +
 net/socket.c  |   17 ++---
 6 files changed, 40 insertions(+), 14 deletions(-)

--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-08 11:57:41.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/Documentation/filesystems/vfs.txt  2007-03-08 
10:14:38.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/vfs.txt   2007-03-08 
10:34:34.0 +0100
@@ -827,7 +827,7 @@ This describes how a filesystem can over
 operations. Dentries and the dcache are the domain of the VFS and the
 individual filesystem implementations. Device drivers have no business
 here. These methods may be set to NULL, as they are either optional or
-the VFS uses a default. As of kernel 2.6.13, the following members are
+the VFS uses a default. As of kernel 2.6.22, the following members are
 defined:
 
 struct dentry_operations {
@@ -837,6 +837,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)(struct dentry *, char *, int);
 };
 
   d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -859,6 +860,15 @@ struct dentry_operations {
VFS calls iput(). If you define this method, you must call
iput() yourself
 
+  d_dname: called when the pathname of a dentry should be generated.
+   Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay
+   pathname generation. (Instead of doing it when dentry is created,
+   its done only when the path is needed.). Real filesystems probably
+   dont want to use it, because their dentries are present in global
+   dcache hash, so their hash should be an invariant. As no lock is
+   held, d_dname() should not try to modify the dentry itself, unless
+   appropriate SMP safety is used.
+
 Each dentry has a pointer to its parent dentry, as well as a hash list
 of child dentries. Child dentries are basically like files in a
 directory.
--- linux-2.6.21-rc3/Documentation/filesystems/Locking  2007-03-08 
10:29:04.0 +0100
+++ linux-2.6.21-rc3-ed/Documentation/filesystems/Locking   2007-03-08 
10:29:04.0 +0100
@@ -15,6 +15,7 @@ prototypes:
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
 
 locking rules:
none have BKL
@@ -25,6 +26,7 @@ d_compare:no  yes no  
no 
 d_delete:  yes no  yes no
 d_release: no  no  no  yes
 d_iput:no  no  no  yes
+d_dname:   no  no  no  no
 
 --- inode_operations --- 
 prototypes:
--- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-08 11:57:41.0 +0100
@@ -1823,6 +1823,13 @@ char 

Re: [patch] epoll use a single inode ...

2007-03-08 Thread Christoph Hellwig

I'm sorry about complaining again, but..

  
 + /*
 +  * Some filesystems want to provide dentry's pathname themselves,
 +  * instead of pre-building names at dentry creation.
 +  */

It's not _some_ filesystems.  If real filesystem did this we'd be in
horrible trouble.  It's synthetic, non-mountable filesystem.

Except for this the patch looks fine to me, thanks Eric!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet

Kyle Moffett a écrit :


Prefetching is also fairly critical on a Power4 or G5 PowerPC system as 
they have a long memory latency; an L2-cache miss can cost 200+ cycles.  
On such systems the "dcbt" prefetch instruction brings in a single 
128-byte cacheline and has no serializing effects whatsoever, making it 
ideal for use in a linked-list-traversal inner loop.


OK, 200 cycles...

But what is the cost of the conditional branch you added in prefetch(x) ?

if (!x) return;

(correctly predicted or not, but do powerPC have a BTB ?)

About the NULL 'potential problem', maybe we could use a dummy nil (but 
mapped) object, and use its address in lists, ie compare for  instead of 
NULL. This would avoid :


- The conditional test in some prefetch() implementations
- The potential TLB problem with the NULL value.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Michael K. Edwards wrote:
> 
> People's prejudices against prefetch instructions are sometimes
> traceable to the 3DNow! prefetch(w) botch, which some processors
> "support" as no-ops and others are too aggressive about (Opteron
> prefetches are reputed to be "strong", i. e., not dropped on DTLB
> miss).

No, I just checked, and Intel's own optimization manual makes it clear 
that you should be careful. They talk about performance penalties due to 
resource constraints - which makes tons of sense with a core that is good 
at handling its own resources and could quite possibly use those resources 
better to actually execute the loads and stores deeper down the 
instruction pipeline.

So it's not just 3DNow! making AMD look bad, or Intel would obviously 
suggest people use it out of the wazoo ;)

> XScale gets it right.

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. 
Calling that "getting it right" is ludicrous. If anything, it gets things 
so wrong that prefetching is *required* for good performance.

I'm talking about real CPU's with real memory pipelines that already do 
prefetching in hardware. The better the core is, the less the prefetch 
helps (and often the more it hurts in comparison to how much it helps).

But if you mean "doesn't try to fill the TLB on data prefetches", then 
yes, that's generally the right thing to do.

> (Oddly, Prescott seems to have initiated a page table walk on DTLB miss 
> during software prefetch -- just one of many weird Prescott flaws.)  

Netburst in general is *very* happy to do speculative TLB fills, I think.

> I'm guessing Pentium M and its descendants (Core Solo and Duo) get it 
> right but I'm having a hell of a time finding out for sure.  Can any of 
> the x86 experts answer this?

I just suspect that the upside for Core 2 Due is likely fairly low. The L2 
cache is good, the memory re-ordering is working.. I doubt "prefetch" 
helps in generic code that much for things like linked list following, you 
should probably limit it to code that has *known* access patterns and you 
know it's not going to be in the cache.

(In other words, I bet prefetching can help a lot with MMX/media kind of 
code, I doubt it's a huge win for "for_each_entry()")

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Kyle Moffett

On Mar 07, 2007, at 20:25:14, Michael K. Edwards wrote:

On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote

In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do  
exactly the above kinds of things to make sure that you don't try  
to prefetch the non-existent end entry)

AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not*  
do software prefetching, because hw can usually simply do better  
without it.


Not the XScale -- it performs quite poorly without prefetch, as  
people who have run ARMv5-optimized binaries on it can testify.


The Intel XScale(r) core prefetch load instruction is a true  
prefetch instruction because the load destination is the data or  
mini-data cache and not a register. Compilers for processors which  
have data caches, but do not support prefetch, sometimes use a load  
instruction to preload the data cache. This technique has the  
disadvantages of using a register to load data and requiring  
additional registers for
subsequent preloads and thus increasing register pressure. By  
contrast, the prefetch can be used to reduce register pressure  
instead of increasing it.


The prefetch load is a hint instruction and does not guarantee that  
the data will be loaded. Whenever the load would cause a fault or a  
table walk, then the processor will ignore the prefetch  
instruction, the fault or table walk, and continue processing the  
next instruction. This is particularly advantageous in the case  
where a linked list or recursive data structure is terminated by a  
NULL pointer. Prefetching the NULL pointer will not fault program  
flow.


Prefetching is also fairly critical on a Power4 or G5 PowerPC system  
as they have a long memory latency; an L2-cache miss can cost 200+  
cycles.  On such systems the "dcbt" prefetch instruction brings in a  
single 128-byte cacheline and has no serializing effects whatsoever,  
making it ideal for use in a linked-list-traversal inner loop.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Michael K. Edwards

On 3/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote

Yeah, I'm not at all surprised. Any implementation of "prefetch" that
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes
them weaker for *actual* prefetching) will generally have a hard time with
a NULL pointer. Exactly because it will try to do a totally unnecessary
TLB fill - and since most CPU's will not cache negative TLB entries, that
unnecessary TLB fill will be done over and over and over again..


Data prefetch instructions should indeed avoid page table walks.
(Instruction prefetch mechanisms often do induce table walks on ITLB
miss.)  Not just because of the null pointer case, but because it's
quite normal to run off the end of an array in a loop with an embedded
prefetch instruction.  If you have an extra instruction issue unit
that shares the same DTLB, and you know you will really want that
data, you can sometimes use it to force DTLB preloads by doing an
actual data fetch from the foreseeable page.  This is potentially one
of the best uses of chip multi-threading on an architecture like Sun's
Niagara.

(I don't think Intel's hyper-threading works for this purpose; the
DTLB is shared but the entries are marked as owned by one thread or
the other.  HT can be used for L2 cache prefetching, although the
results so far seem to be mixed:
http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVISED.pdf)


In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly
   the above kinds of things to make sure that you don't try to prefetch
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do
software prefetching, because hw can usually simply do better without it.


Not the XScale -- it performs quite poorly without prefetch, as people
who have run ARMv5-optimized binaries on it can testify.  From the
XScale Core Developer's Manual:


The Intel XScale(r) core has a true prefetch load instruction (PLD).
The purpose of this instruction is to preload data into the data and
mini-data caches. Data prefetching allows hiding of memory transfer
latency while the processor continues to execute instructions. The
prefetch is important to compiler and assembly code because judicious
use of the prefetch instruction can enormously improve throughput
performance of the core. Data prefetch can be applied not only to
loops but also to any data references within a block of code. Prefetch
also applies to data writing when the memory type is enabled as write
allocate

The Intel XScale(r) core prefetch load instruction is a true prefetch
instruction because the load destination is the data or mini-data
cache and not a register. Compilers for processors which have data
caches, but do not support prefetch, sometimes use a load instruction
to preload the data cache. This technique has the disadvantages of
using a register to load data and requiring additional registers for
subsequent preloads and thus increasing register pressure. By
contrast, the prefetch can be used to reduce register pressure instead
of increasing it.

The prefetch load is a hint instruction and does not guarantee that
the data will be loaded. Whenever the load would cause a fault or a
table walk, then the processor will ignore the prefetch instruction,
the fault or table walk, and continue processing the next instruction.
This is particularly advantageous in the case where a linked list or
recursive data structure is terminated by a NULL pointer. Prefetching
the NULL pointer will not fault program flow.


People's prejudices against prefetch instructions are sometimes
traceable to the 3DNow! prefetch(w) botch, which some processors
"support" as no-ops and others are too aggressive about (Opteron
prefetches are reputed to be "strong", i. e., not dropped on DTLB
miss).  XScale gets it right.  So do most Pentium 4's using the SSE
prefetches, according to the IA-32 optimization manual.  (Oddly,
Prescott seems to have initiated a page table walk on DTLB miss during
software prefetch -- just one of many weird Prescott flaws.)  I'm
guessing Pentium M and its descendants (Core Solo and Duo) get it
right but I'm having a hell of a time finding out for sure.  Can any
of the x86 experts answer this?

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Anton Blanchard wrote:
> 
> Funny you mention this. We found some noticeable ppc64 regressions when
> moving the dcache to standard list macros and had to do this to fix it
> up:
> 
> static inline void prefetch(const void *x)
> {
> if (unlikely(!x))
> return;
> 
> __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
> }
> 
> Urgh :)

Yeah, I'm not at all surprised. Any implementation of "prefetch" that 
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes 
them weaker for *actual* prefetching) will generally have a hard time with 
a NULL pointer. Exactly because it will try to do a totally unnecessary 
TLB fill - and since most CPU's will not cache negative TLB entries, that 
unnecessary TLB fill will be done over and over and over again..

In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly 
   the above kinds of things to make sure that you don't try to prefetch 
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do 
software prefetching, because hw can usually simply do better without it.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Anton Blanchard

Hi,

> Well, I hope a prefetch(NULL) is OK because we are doing millions of
> them (see include/linux/list.h :) )

Funny you mention this. We found some noticeable ppc64 regressions when
moving the dcache to standard list macros and had to do this to fix it
up:

static inline void prefetch(const void *x)
{
if (unlikely(!x))
return;

__asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
}

Urgh :)

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> I was thinking about being able to cache the name into the dentry, do you 
> think it's worth the pain ? (its not SMP safe for example...)

Actually, it *can* be SMP-safe, if you do it right. Something like

len = dentry->d_name.len;

if (!len) {
len = snprintf(dentry->d_iname,
DNAME_INLINE_LEN_MIN,
"pipe:[%lu]",
dentry->d_inode->i_ino);
smp_wmb();
dentry->d_name.len = len;
}
if (len >= buflen)
len = buflen-1;
memcpy(buffer, dentry->d_iname, len);
buffer[len] = 0;
return buffer;

should work, although it depends on the fact that our snprintf() 
implementation should be "stable" (ie if snprintf() modifies the buffer 
temporarily as it goes along, that would break, but I think our 
vsnprintf is good in that respect).

So you could have two different CPU's doing the snprintf() on the same 
buffer at the same time (and assigning the length at the same time), but 
since they'll write the same thing, you don't really care.

It's a bit subtle, though. And probably not really worth it.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet

> Not only might memcpy() do a "prefetch for read" on the source for some
> architectures (which in turn may end up being slow for an address that
> isn't in the TLB, like NULL), but you depend on a very much internal

Well, I hope a prefetch(NULL) is OK because we are doing millions of them (see 
include/linux/list.h :) )

> detail, since it *could* have been using something like
>
>   memcpy(dname, name->name, name->len+1);
>

Yes very true, I will change that and push to Andrew for mm testing

I was thinking about being able to cache the name into the dentry, do you 
think it's worth the pain ? (its not SMP safe for example...)

static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
{
if (!dentry->d_iname[0])
snprintf(dentry->d_iname,
DNAME_INLINE_LEN_MIN,
"pipe:[%lu]",
dentry->d_inode->i_ino);
strlcpy(buffer, dentry->d_iname, buflen);
return buffer;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> OK no problem here is the patch without messing f_path.mnt 

All right. I like it. Definitely worth putting into -mm, or just 
re-sending to me after 2.6.21 is out (I'll forget all about it otherwise).

I have one more worry, namely this::

-   char name[32];
-   this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
-   this.name = name;
+   this.len = 0;
+   this.name = NULL;

I think that's fine, and it *happens* to work, but it happens to work just 
because then d_alloc() will do:

memcpy(dname, name->name, name->len);
dname[name->len] = 0;

and passing in NULL to memcpy() is generally ok when len is 0.

HOWEVER. 

Not only might memcpy() do a "prefetch for read" on the source for some 
architectures (which in turn may end up being slow for an address that 
isn't in the TLB, like NULL), but you depend on a very much internal 
detail, since it *could* have been using something like

memcpy(dname, name->name, name->len+1);

instead, and expected to get the final '\0' character from the source 
string.

So I would actually much prefer it to be written as

this.len = 0;
this.name = "";

just because it's safer.

But other than that small detail, I think this is not only an 
optimization, it's an actual cleanup, and we migth some day want to use 
something like this for some other things too (eg maybe this kind of 
approach is usable for /proc/ entries too, to avoid instantiating 
them).

As to avoiding the mntget(), I'm not violently opposed to it, but I do 
think that it's a totally unrelated matter, so even if it's decided it's 
worth it, it should be done as a separate patch regardless.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet
On Wednesday 07 March 2007 18:45, Linus Torvalds wrote:
> On Wed, 7 Mar 2007, Eric Dumazet wrote:
> > sockets already uses file->private_data.
> >
> > But calls to read()/write() (not send()/recv()) still need to go through
> > the dentry, before entering socket land.
>
> Sure. The dentry and the inode need to *exist*, but they can be one single
> static dentry/inode per "file descriptor type".
>
> We always pass in the "struct file *" to read/write too, since we need it
> anyway for things like file control information (eg "is it a nonblocking
> read or write" kinds of things).
>
> So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static
> one per type.
>
> And yeah, it may be harder than it looks. Some things "know" that all the
> relevant info is in the inode, so they just pass in the inode. In the pipe
> layer, for example, you'd need to change free_pipe_info() and
> alloc_pipe_info() to pass in the file descriptor instead, same goes for
> pipe_release(). But the "struct file *" is always available, it's just
> that since the code was originally written to have all the info in the
> inode, some of the code isn't set up to use it or pass it on..
>
> But your patch is independent of that, and looks fine. Except I don't like
> this part:
>
> -   file->f_path.mnt = mntget(sock_mnt);
> +   file->f_path.mnt = NULL;
>
> since I'd be much happer with always having f_path.mnt available, the same
> way we should always have f_path.dentry there.

Yes, but mntget()/mntput() are protected against NULL.
I was quite happy to remove two locked operations :)
I didnt found a way to crash (yet) my patched machine :)

>
> (Btw, your patch is *not* going to work with the file->f_private_data
> approach, because d_path() is not passed down the "file *" thing. So we'd
> need to do that, and that's more intrusive (it can be NULL, since for
> things like cwd/pwd we don't have a "struct file").

I tried this path today and failed...
Too many changes to do (nameidata) to propagate a 'struct file *' 
appropriately...

>
> But I like your patch as a totally independent thing. "It just makes
> sense".
>
> (Apart from the f_path.mnt thing, which I think was something else ;)

OK no problem here is the patch without messing f_path.mnt 

(benchmark results not really different on my little machine, SMP kernel but 
one CPU only... maybe because lock suffix is changed by a nop)


[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
 fs/dcache.c|3 +++
 fs/pipe.c  |   12 +---
 include/linux/dcache.h |1 +
 net/socket.c   |   13 ++---
 4 files changed, 23 insertions(+), 6 deletions(-)
--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-07 17:27:39.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
struct vfsmount *rootmnt;
struct dentry *root;
 
+   if (dentry->d_op && dentry->d_op->d_dname)
+   return (dentry->d_op->d_dname)(dentry, buf, buflen);
+
read_lock(>fs->lock);
rootmnt = mntget(current->fs->rootmnt);
root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c  2007-03-07 17:42:36.0 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c   2007-03-07 18:54:33.0 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+   snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+   

Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> sockets already uses file->private_data.
> 
> But calls to read()/write() (not send()/recv()) still need to go through the 
> dentry, before entering socket land.

Sure. The dentry and the inode need to *exist*, but they can be one single 
static dentry/inode per "file descriptor type".

We always pass in the "struct file *" to read/write too, since we need it 
anyway for things like file control information (eg "is it a nonblocking 
read or write" kinds of things).

So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static 
one per type.

And yeah, it may be harder than it looks. Some things "know" that all the 
relevant info is in the inode, so they just pass in the inode. In the pipe 
layer, for example, you'd need to change free_pipe_info() and 
alloc_pipe_info() to pass in the file descriptor instead, same goes for 
pipe_release(). But the "struct file *" is always available, it's just 
that since the code was originally written to have all the info in the 
inode, some of the code isn't set up to use it or pass it on..

But your patch is independent of that, and looks fine. Except I don't like 
this part:

-   file->f_path.mnt = mntget(sock_mnt);
+   file->f_path.mnt = NULL;

since I'd be much happer with always having f_path.mnt available, the same 
way we should always have f_path.dentry there.

(Btw, your patch is *not* going to work with the file->f_private_data 
approach, because d_path() is not passed down the "file *" thing. So we'd 
need to do that, and that's more intrusive (it can be NULL, since for 
things like cwd/pwd we don't have a "struct file").

But I like your patch as a totally independent thing. "It just makes 
sense". 

(Apart from the f_path.mnt thing, which I think was something else ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet
(resending with a more convenient attachment)

Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

 fs/dcache.c            |    3 +++
 fs/pipe.c              |   16 +++-
 include/linux/dcache.h |    1 +
 net/socket.c           |   15 +++
 4 files changed, 26 insertions(+), 9 deletions(-)
--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-07 17:27:39.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
struct vfsmount *rootmnt;
struct dentry *root;
 
+   if (dentry->d_op && dentry->d_op->d_dname)
+   return (dentry->d_op->d_dname)(dentry, buf, buflen);
+
read_lock(>fs->lock);
rootmnt = mntget(current->fs->rootmnt);
root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c  2007-03-07 17:42:36.0 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c   2007-03-07 18:01:40.0 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+   snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+   return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
.d_delete   = pipefs_delete_dentry,
+   .d_dname= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
struct inode *inode;
struct file *f;
struct dentry *dentry;
-   char name[32];
struct qstr this;
 
f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
if (!inode)
goto err_file;
 
-   this.len = sprintf(name, "[%lu]", inode->i_ino);
-   this.name = name;
+   this.len = 0;
+   this.name = NULL;
this.hash = 0;
err = -ENOMEM;
dentry = d_alloc(pipe_mnt->mnt_sb->s_root, );
@@ -915,7 +921,7 @@ struct file *create_write_pipe(void)
 */
dentry->d_flags &= ~DCACHE_UNHASHED;
d_instantiate(dentry, inode);
-   f->f_path.mnt = mntget(pipe_mnt);
+   f->f_path.mnt = NULL;
f->f_path.dentry = dentry;
f->f_mapping = inode->i_mapping;
 
@@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil
return ERR_PTR(-ENFILE);
 
/* Grab pipe from the writer */
-   f->f_path.mnt = mntget(wrf->f_path.mnt);
+   f->f_path.mnt = NULL;
f->f_path.dentry = dget(wrf->f_path.dentry);
f->f_mapping = wrf->f_path.dentry->d_inode->i_mapping;
 
--- linux-2.6.21-rc3/net/socket.c   2007-03-07 17:37:56.0 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c2007-03-07 17:54:53.0 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
dentry->d_flags |= DCACHE_UNHASHED;
return 0;
 }
+
+static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+   snprintf(buffer, buflen, "socket:[%lu]", dentry->d_inode->i_ino);
+   return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
.d_delete = sockfs_delete_dentry,
+   .d_dname  = sockfs_dname,
 };
 
 /*
@@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
struct qstr this;
-   char 

Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet
On Wednesday 07 March 2007 18:02, Linus Torvalds wrote:
> On Wed, 7 Mar 2007, Eric Dumazet wrote:
> > I would definitly *love* saving dentries for pipes (and sockets too), but
> > how are you going to get the inode ?
>
> Don't use an inode at all.

Lovely :)

>
> > pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
> > file->f_path.dentry->d_inode (so each pipe needs a separate dentry)
>
> No, at least pipes could easily just use "file->f_private_data" instead.
>
> Now, sockets really do want the inode (or it would be really really big
> changes), but pipes really just want a "struct pipe_inode_info" pointer,
> which we could hide away directly in the file descriptor itself.

sockets already uses file->private_data.

But calls to read()/write() (not send()/recv()) still need to go through the 
dentry, before entering socket land.

>
> That's what Davide already did (on my suggestion) for signalfd - there's a
> *single* inode, and the real data is in the per-fd f_private_data.


Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

 fs/dcache.c|3 +++
 fs/pipe.c  |   16 +++-
 include/linux/dcache.h |1 +
 net/socket.c   |   15 +++
 4 files changed, 26 insertions(+), 9 deletions(-)

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-07 17:27:39.0 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-07 17:28:46.0 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	if (dentry->d_op && dentry->d_op->d_dname)
+		return (dentry->d_op->d_dname)(dentry, buf, buflen);
+
 	read_lock(>fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
 	root = dget(current->fs->root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.0 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-07 18:01:40.0 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, "pipe:[%lu]", dentry->d_inode->i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
 	struct qstr this;
 
 	f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, "[%lu]", inode->i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 	err = -ENOMEM;
 	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, );
@@ -915,7 +921,7 @@ struct file *create_write_pipe(void)
 	 */
 	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, inode);
-	f->f_path.mnt = mntget(pipe_mnt);
+	f->f_path.mnt = NULL;
 	f->f_path.dentry = dentry;
 	f->f_mapping = inode->i_mapping;
 
@@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil
 		return ERR_PTR(-ENFILE);
 
 	/* Grab pipe from the writer */
-	f->f_path.mnt = mntget(wrf->f_path.mnt);
+	f->f_path.mnt = NULL;
 	f->f_path.dentry = dget(wrf->f_path.dentry);
 	f->f_mapping = wrf->f_path.dentry->d_inode->i_mapping;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.0 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-07 17:54:53.0 

Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> Crazy ideas : (some readers are going to kill me)

First off, as noted earlier, you don't need crazy ideas. 

But:

> 1) Use the low order bit of f_path.dentry to say : this pointer is not a
> pointer to a dentry but the inode pointer (with the low order bit set to 1)

No, we don't want to do that. There's a lot of code that wants to just 
follow the dentry/inode chain unconditionally, and we want to have people 
able to just do

i_size_read(file->f_path.dentry->d_inode);

kind of thing, without having to check anything in between. It's costly 
enough that we have a chain of pointers, it's much *worse* if we then have 
to make conditionals etc (and forgetting is an oops).

> 2) file->f_path.dentry set to NULL for this special files (so that we dont
> need to dput() and cache line ping pong the common dentry each time we
> __fput() a pipe/socket.

I agree abot the ping pong, but I suspect it's a lot less than the current 
pingpong on dcache_lock, which is a lot hotter than a pipe-dentry counter 
would be.

If it becomes a big issue, we could just have a fixed number of dentries, 
and hash them out some way (to dilute the ping-ponging). So making things 
NULL would just be horrible for everybody, since we have lots of generic 
code that just looks up the dentry and inode, and we don't want to make 
that worse.

(I'm sure using "filp->f_private_data" has some downsides too, but it 
seems fairly simple at least for pipes. The biggest problem is likely that 
we currently use the mutex in the inode in "filp->f_dentry->d_inode" as a 
way to protect the inode->i_pipe pointer itself, and there is no 
equivalent locking at the "struct file *" level at all. We might have to 
make the "f_ep_lock" spinlock be unconditional)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
> 
> I would definitly *love* saving dentries for pipes (and sockets too), but how
> are you going to get the inode ?

Don't use an inode at all.

> pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
> file->f_path.dentry->d_inode (so each pipe needs a separate dentry)

No, at least pipes could easily just use "file->f_private_data" instead.

Now, sockets really do want the inode (or it would be really really big 
changes), but pipes really just want a "struct pipe_inode_info" pointer, 
which we could hide away directly in the file descriptor itself.

That's what Davide already did (on my suggestion) for signalfd - there's a 
*single* inode, and the real data is in the per-fd f_private_data.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Christoph Hellwig
On Wed, Mar 07, 2007 at 08:16:14AM +0100, Eric Dumazet wrote:
> Crazy ideas : (some readers are going to kill me)
> 
> 1) Use the low order bit of f_path.dentry to say : this pointer is not a 
> pointer to a dentry but the inode pointer (with the low order bit set to 1)
> 
> OR
> 
> 2) file->f_path.dentry set to NULL for this special files (so that we dont 
> need to dput() and cache line ping pong the common dentry each time we 
> __fput() a pipe/socket.

No way on either one.  f_path.dentry always beeing there is an assumption
we make all over the place, and changing that would be a big regression
for code qualityand reliability all over the place.

Face it folks, memory is generally cheap, and we're not going to uglify
huge amounts of code to shave of a little bit.

[and that is only in reply to this one, the single dentry optimizations
 for epoll and friends are perfecltly fine from the high level view]

> Same trick could be used for file->f_path.mnt, because there is a big SMP 
> cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint 
> while these file systems cannot be un-mounted)

Same thing as above.  We might do a hack to not refcount these vfsmounts,
but we definitively want to keep the invariant of f_path.mnt never
beeing NULL.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Christoph Hellwig
On Wed, Mar 07, 2007 at 08:16:14AM +0100, Eric Dumazet wrote:
 Crazy ideas : (some readers are going to kill me)
 
 1) Use the low order bit of f_path.dentry to say : this pointer is not a 
 pointer to a dentry but the inode pointer (with the low order bit set to 1)
 
 OR
 
 2) file-f_path.dentry set to NULL for this special files (so that we dont 
 need to dput() and cache line ping pong the common dentry each time we 
 __fput() a pipe/socket.

No way on either one.  f_path.dentry always beeing there is an assumption
we make all over the place, and changing that would be a big regression
for code qualityand reliability all over the place.

Face it folks, memory is generally cheap, and we're not going to uglify
huge amounts of code to shave of a little bit.

[and that is only in reply to this one, the single dentry optimizations
 for epoll and friends are perfecltly fine from the high level view]

 Same trick could be used for file-f_path.mnt, because there is a big SMP 
 cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint 
 while these file systems cannot be un-mounted)

Same thing as above.  We might do a hack to not refcount these vfsmounts,
but we definitively want to keep the invariant of f_path.mnt never
beeing NULL.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
 
 I would definitly *love* saving dentries for pipes (and sockets too), but how
 are you going to get the inode ?

Don't use an inode at all.

 pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
 file-f_path.dentry-d_inode (so each pipe needs a separate dentry)

No, at least pipes could easily just use file-f_private_data instead.

Now, sockets really do want the inode (or it would be really really big 
changes), but pipes really just want a struct pipe_inode_info pointer, 
which we could hide away directly in the file descriptor itself.

That's what Davide already did (on my suggestion) for signalfd - there's a 
*single* inode, and the real data is in the per-fd f_private_data.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
 
 Crazy ideas : (some readers are going to kill me)

First off, as noted earlier, you don't need crazy ideas. 

But:

 1) Use the low order bit of f_path.dentry to say : this pointer is not a
 pointer to a dentry but the inode pointer (with the low order bit set to 1)

No, we don't want to do that. There's a lot of code that wants to just 
follow the dentry/inode chain unconditionally, and we want to have people 
able to just do

i_size_read(file-f_path.dentry-d_inode);

kind of thing, without having to check anything in between. It's costly 
enough that we have a chain of pointers, it's much *worse* if we then have 
to make conditionals etc (and forgetting is an oops).

 2) file-f_path.dentry set to NULL for this special files (so that we dont
 need to dput() and cache line ping pong the common dentry each time we
 __fput() a pipe/socket.

I agree abot the ping pong, but I suspect it's a lot less than the current 
pingpong on dcache_lock, which is a lot hotter than a pipe-dentry counter 
would be.

If it becomes a big issue, we could just have a fixed number of dentries, 
and hash them out some way (to dilute the ping-ponging). So making things 
NULL would just be horrible for everybody, since we have lots of generic 
code that just looks up the dentry and inode, and we don't want to make 
that worse.

(I'm sure using filp-f_private_data has some downsides too, but it 
seems fairly simple at least for pipes. The biggest problem is likely that 
we currently use the mutex in the inode in filp-f_dentry-d_inode as a 
way to protect the inode-i_pipe pointer itself, and there is no 
equivalent locking at the struct file * level at all. We might have to 
make the f_ep_lock spinlock be unconditional)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet
On Wednesday 07 March 2007 18:02, Linus Torvalds wrote:
 On Wed, 7 Mar 2007, Eric Dumazet wrote:
  I would definitly *love* saving dentries for pipes (and sockets too), but
  how are you going to get the inode ?

 Don't use an inode at all.

Lovely :)


  pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
  file-f_path.dentry-d_inode (so each pipe needs a separate dentry)

 No, at least pipes could easily just use file-f_private_data instead.

 Now, sockets really do want the inode (or it would be really really big
 changes), but pipes really just want a struct pipe_inode_info pointer,
 which we could hide away directly in the file descriptor itself.

sockets already uses file-private_data.

But calls to read()/write() (not send()/recv()) still need to go through the 
dentry, before entering socket land.


 That's what Davide already did (on my suggestion) for signalfd - there's a
 *single* inode, and the real data is in the per-fd f_private_data.


Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

 fs/dcache.c|3 +++
 fs/pipe.c  |   16 +++-
 include/linux/dcache.h |1 +
 net/socket.c   |   15 +++
 4 files changed, 26 insertions(+), 9 deletions(-)

--- linux-2.6.21-rc3/include/linux/dcache.h	2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h	2007-03-07 17:27:39.0 +0100
@@ -133,6 +133,7 @@ struct dentry_operations {
 	int (*d_delete)(struct dentry *);
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
+	char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c	2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c	2007-03-07 17:28:46.0 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
 	struct vfsmount *rootmnt;
 	struct dentry *root;
 
+	if (dentry-d_op  dentry-d_op-d_dname)
+		return (dentry-d_op-d_dname)(dentry, buf, buflen);
+
 	read_lock(current-fs-lock);
 	rootmnt = mntget(current-fs-rootmnt);
 	root = dget(current-fs-root);
--- linux-2.6.21-rc3/fs/pipe.c	2007-03-07 17:42:36.0 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c	2007-03-07 18:01:40.0 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
 	return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	snprintf(buffer, buflen, pipe:[%lu], dentry-d_inode-i_ino);
+	return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
 	.d_delete	= pipefs_delete_dentry,
+	.d_dname	= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
 	struct inode *inode;
 	struct file *f;
 	struct dentry *dentry;
-	char name[32];
 	struct qstr this;
 
 	f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
 	if (!inode)
 		goto err_file;
 
-	this.len = sprintf(name, [%lu], inode-i_ino);
-	this.name = name;
+	this.len = 0;
+	this.name = NULL;
 	this.hash = 0;
 	err = -ENOMEM;
 	dentry = d_alloc(pipe_mnt-mnt_sb-s_root, this);
@@ -915,7 +921,7 @@ struct file *create_write_pipe(void)
 	 */
 	dentry-d_flags = ~DCACHE_UNHASHED;
 	d_instantiate(dentry, inode);
-	f-f_path.mnt = mntget(pipe_mnt);
+	f-f_path.mnt = NULL;
 	f-f_path.dentry = dentry;
 	f-f_mapping = inode-i_mapping;
 
@@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil
 		return ERR_PTR(-ENFILE);
 
 	/* Grab pipe from the writer */
-	f-f_path.mnt = mntget(wrf-f_path.mnt);
+	f-f_path.mnt = NULL;
 	f-f_path.dentry = dget(wrf-f_path.dentry);
 	f-f_mapping = wrf-f_path.dentry-d_inode-i_mapping;
 
--- linux-2.6.21-rc3/net/socket.c	2007-03-07 17:37:56.0 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c	2007-03-07 17:54:53.0 +0100
@@ -314,8 +314,16 @@ static int 

Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet
(resending with a more convenient attachment)

Please find enclosed the following patch, to prepare this path.

[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on sock_mnt

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...
We also avoid mntput()/mntget() on pipe_mnt

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

 fs/dcache.c            |    3 +++
 fs/pipe.c              |   16 +++-
 include/linux/dcache.h |    1 +
 net/socket.c           |   15 +++
 4 files changed, 26 insertions(+), 9 deletions(-)
--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-07 17:27:39.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
struct vfsmount *rootmnt;
struct dentry *root;
 
+   if (dentry-d_op  dentry-d_op-d_dname)
+   return (dentry-d_op-d_dname)(dentry, buf, buflen);
+
read_lock(current-fs-lock);
rootmnt = mntget(current-fs-rootmnt);
root = dget(current-fs-root);
--- linux-2.6.21-rc3/fs/pipe.c  2007-03-07 17:42:36.0 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c   2007-03-07 18:01:40.0 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+   snprintf(buffer, buflen, pipe:[%lu], dentry-d_inode-i_ino);
+   return buffer;
+}
+
 static struct dentry_operations pipefs_dentry_operations = {
.d_delete   = pipefs_delete_dentry,
+   .d_dname= pipefs_dname,
 };
 
 static struct inode * get_pipe_inode(void)
@@ -888,7 +895,6 @@ struct file *create_write_pipe(void)
struct inode *inode;
struct file *f;
struct dentry *dentry;
-   char name[32];
struct qstr this;
 
f = get_empty_filp();
@@ -899,8 +905,8 @@ struct file *create_write_pipe(void)
if (!inode)
goto err_file;
 
-   this.len = sprintf(name, [%lu], inode-i_ino);
-   this.name = name;
+   this.len = 0;
+   this.name = NULL;
this.hash = 0;
err = -ENOMEM;
dentry = d_alloc(pipe_mnt-mnt_sb-s_root, this);
@@ -915,7 +921,7 @@ struct file *create_write_pipe(void)
 */
dentry-d_flags = ~DCACHE_UNHASHED;
d_instantiate(dentry, inode);
-   f-f_path.mnt = mntget(pipe_mnt);
+   f-f_path.mnt = NULL;
f-f_path.dentry = dentry;
f-f_mapping = inode-i_mapping;
 
@@ -949,7 +955,7 @@ struct file *create_read_pipe(struct fil
return ERR_PTR(-ENFILE);
 
/* Grab pipe from the writer */
-   f-f_path.mnt = mntget(wrf-f_path.mnt);
+   f-f_path.mnt = NULL;
f-f_path.dentry = dget(wrf-f_path.dentry);
f-f_mapping = wrf-f_path.dentry-d_inode-i_mapping;
 
--- linux-2.6.21-rc3/net/socket.c   2007-03-07 17:37:56.0 +0100
+++ linux-2.6.21-rc3-ed/net/socket.c2007-03-07 17:54:53.0 +0100
@@ -314,8 +314,16 @@ static int sockfs_delete_dentry(struct d
dentry-d_flags |= DCACHE_UNHASHED;
return 0;
 }
+
+static char * sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+   snprintf(buffer, buflen, socket:[%lu], dentry-d_inode-i_ino);
+   return buffer;
+}
+
 static struct dentry_operations sockfs_dentry_operations = {
.d_delete = sockfs_delete_dentry,
+   .d_dname  = sockfs_dname,
 };
 
 /*
@@ -356,10 +364,9 @@ static int sock_alloc_fd(struct file **f
 static int sock_attach_fd(struct socket *sock, struct file *file)
 {
struct qstr this;
-   char name[32];
 
-   this.len = 

Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
 
 sockets already uses file-private_data.
 
 But calls to read()/write() (not send()/recv()) still need to go through the 
 dentry, before entering socket land.

Sure. The dentry and the inode need to *exist*, but they can be one single 
static dentry/inode per file descriptor type.

We always pass in the struct file * to read/write too, since we need it 
anyway for things like file control information (eg is it a nonblocking 
read or write kinds of things).

So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static 
one per type.

And yeah, it may be harder than it looks. Some things know that all the 
relevant info is in the inode, so they just pass in the inode. In the pipe 
layer, for example, you'd need to change free_pipe_info() and 
alloc_pipe_info() to pass in the file descriptor instead, same goes for 
pipe_release(). But the struct file * is always available, it's just 
that since the code was originally written to have all the info in the 
inode, some of the code isn't set up to use it or pass it on..

But your patch is independent of that, and looks fine. Except I don't like 
this part:

-   file-f_path.mnt = mntget(sock_mnt);
+   file-f_path.mnt = NULL;

since I'd be much happer with always having f_path.mnt available, the same 
way we should always have f_path.dentry there.

(Btw, your patch is *not* going to work with the file-f_private_data 
approach, because d_path() is not passed down the file * thing. So we'd 
need to do that, and that's more intrusive (it can be NULL, since for 
things like cwd/pwd we don't have a struct file).

But I like your patch as a totally independent thing. It just makes 
sense. 

(Apart from the f_path.mnt thing, which I think was something else ;)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet
On Wednesday 07 March 2007 18:45, Linus Torvalds wrote:
 On Wed, 7 Mar 2007, Eric Dumazet wrote:
  sockets already uses file-private_data.
 
  But calls to read()/write() (not send()/recv()) still need to go through
  the dentry, before entering socket land.

 Sure. The dentry and the inode need to *exist*, but they can be one single
 static dentry/inode per file descriptor type.

 We always pass in the struct file * to read/write too, since we need it
 anyway for things like file control information (eg is it a nonblocking
 read or write kinds of things).

 So I'm not suggesting a NULL dentry/inode, I'm suggesting a single static
 one per type.

 And yeah, it may be harder than it looks. Some things know that all the
 relevant info is in the inode, so they just pass in the inode. In the pipe
 layer, for example, you'd need to change free_pipe_info() and
 alloc_pipe_info() to pass in the file descriptor instead, same goes for
 pipe_release(). But the struct file * is always available, it's just
 that since the code was originally written to have all the info in the
 inode, some of the code isn't set up to use it or pass it on..

 But your patch is independent of that, and looks fine. Except I don't like
 this part:

 -   file-f_path.mnt = mntget(sock_mnt);
 +   file-f_path.mnt = NULL;

 since I'd be much happer with always having f_path.mnt available, the same
 way we should always have f_path.dentry there.

Yes, but mntget()/mntput() are protected against NULL.
I was quite happy to remove two locked operations :)
I didnt found a way to crash (yet) my patched machine :)


 (Btw, your patch is *not* going to work with the file-f_private_data
 approach, because d_path() is not passed down the file * thing. So we'd
 need to do that, and that's more intrusive (it can be NULL, since for
 things like cwd/pwd we don't have a struct file).

I tried this path today and failed...
Too many changes to do (nameidata) to propagate a 'struct file *' 
appropriately...


 But I like your patch as a totally independent thing. It just makes
 sense.

 (Apart from the f_path.mnt thing, which I think was something else ;)

OK no problem here is the patch without messing f_path.mnt 

(benchmark results not really different on my little machine, SMP kernel but 
one CPU only... maybe because lock suffix is changed by a nop)


[PATCH] Delay the dentry name generation on sockets and pipes.

1) Introduces a new method in 'struct dentry_operations'. This method called 
d_dname() might be called from d_path() to be able to provide a dentry name 
for special filesystems. It is called without locks.

Future patches (if we succeed in having one common dentry for all pipes) may 
need to change prototype of this method, but we now use :
char *d_dname(struct dentry *dentry, char *buffer, int buflen)


2) Use this new method for sockets : No more sprintf() at socket creation. 
This is delayed up to the moment someone does an access to /proc/pid/fd/...

3) Use this new method for pipes : No more sprintf() at pipe creation. This is 
delayed up to the moment someone does an access to /proc/pid/fd/...

A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a 
*nice* speedup on my Pentium(M) 1.6 Ghz :

3.090 s instead of 3.450 s

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]
 fs/dcache.c|3 +++
 fs/pipe.c  |   12 +---
 include/linux/dcache.h |1 +
 net/socket.c   |   13 ++---
 4 files changed, 23 insertions(+), 6 deletions(-)
--- linux-2.6.21-rc3/include/linux/dcache.h 2007-03-07 17:23:55.0 
+0100
+++ linux-2.6.21-rc3-ed/include/linux/dcache.h  2007-03-07 17:27:39.0 
+0100
@@ -133,6 +133,7 @@ struct dentry_operations {
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
+   char * (*d_dname)(struct dentry *, char *, int);
 };
 
 /* the dentry parameter passed to d_hash and d_compare is the parent
--- linux-2.6.21-rc3/fs/dcache.c2007-03-07 17:23:55.0 +0100
+++ linux-2.6.21-rc3-ed/fs/dcache.c 2007-03-07 17:28:46.0 +0100
@@ -1823,6 +1823,9 @@ char * d_path(struct dentry *dentry, str
struct vfsmount *rootmnt;
struct dentry *root;
 
+   if (dentry-d_op  dentry-d_op-d_dname)
+   return (dentry-d_op-d_dname)(dentry, buf, buflen);
+
read_lock(current-fs-lock);
rootmnt = mntget(current-fs-rootmnt);
root = dget(current-fs-root);
--- linux-2.6.21-rc3/fs/pipe.c  2007-03-07 17:42:36.0 +0100
+++ linux-2.6.21-rc3-ed/fs/pipe.c   2007-03-07 18:54:33.0 +0100
@@ -841,8 +841,15 @@ static int pipefs_delete_dentry(struct d
return 0;
 }
 
+static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+   snprintf(buffer, buflen, pipe:[%lu], dentry-d_inode-i_ino);
+   return buffer;
+}
+
 static struct dentry_operations 

Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
 
 OK no problem here is the patch without messing f_path.mnt 

All right. I like it. Definitely worth putting into -mm, or just 
re-sending to me after 2.6.21 is out (I'll forget all about it otherwise).

I have one more worry, namely this::

-   char name[32];
-   this.len = sprintf(name, [%lu], SOCK_INODE(sock)-i_ino);
-   this.name = name;
+   this.len = 0;
+   this.name = NULL;

I think that's fine, and it *happens* to work, but it happens to work just 
because then d_alloc() will do:

memcpy(dname, name-name, name-len);
dname[name-len] = 0;

and passing in NULL to memcpy() is generally ok when len is 0.

HOWEVER. 

Not only might memcpy() do a prefetch for read on the source for some 
architectures (which in turn may end up being slow for an address that 
isn't in the TLB, like NULL), but you depend on a very much internal 
detail, since it *could* have been using something like

memcpy(dname, name-name, name-len+1);

instead, and expected to get the final '\0' character from the source 
string.

So I would actually much prefer it to be written as

this.len = 0;
this.name = ;

just because it's safer.

But other than that small detail, I think this is not only an 
optimization, it's an actual cleanup, and we migth some day want to use 
something like this for some other things too (eg maybe this kind of 
approach is usable for /proc/pid entries too, to avoid instantiating 
them).

As to avoiding the mntget(), I'm not violently opposed to it, but I do 
think that it's a totally unrelated matter, so even if it's decided it's 
worth it, it should be done as a separate patch regardless.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet

 Not only might memcpy() do a prefetch for read on the source for some
 architectures (which in turn may end up being slow for an address that
 isn't in the TLB, like NULL), but you depend on a very much internal

Well, I hope a prefetch(NULL) is OK because we are doing millions of them (see 
include/linux/list.h :) )

 detail, since it *could* have been using something like

   memcpy(dname, name-name, name-len+1);


Yes very true, I will change that and push to Andrew for mm testing

I was thinking about being able to cache the name into the dentry, do you 
think it's worth the pain ? (its not SMP safe for example...)

static char * pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
{
if (!dentry-d_iname[0])
snprintf(dentry-d_iname,
DNAME_INLINE_LEN_MIN,
pipe:[%lu],
dentry-d_inode-i_ino);
strlcpy(buffer, dentry-d_iname, buflen);
return buffer;
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Eric Dumazet wrote:
 
 I was thinking about being able to cache the name into the dentry, do you 
 think it's worth the pain ? (its not SMP safe for example...)

Actually, it *can* be SMP-safe, if you do it right. Something like

len = dentry-d_name.len;

if (!len) {
len = snprintf(dentry-d_iname,
DNAME_INLINE_LEN_MIN,
pipe:[%lu],
dentry-d_inode-i_ino);
smp_wmb();
dentry-d_name.len = len;
}
if (len = buflen)
len = buflen-1;
memcpy(buffer, dentry-d_iname, len);
buffer[len] = 0;
return buffer;

should work, although it depends on the fact that our snprintf() 
implementation should be stable (ie if snprintf() modifies the buffer 
temporarily as it goes along, that would break, but I think our 
vsnprintf is good in that respect).

So you could have two different CPU's doing the snprintf() on the same 
buffer at the same time (and assigning the length at the same time), but 
since they'll write the same thing, you don't really care.

It's a bit subtle, though. And probably not really worth it.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Anton Blanchard

Hi,

 Well, I hope a prefetch(NULL) is OK because we are doing millions of
 them (see include/linux/list.h :) )

Funny you mention this. We found some noticeable ppc64 regressions when
moving the dcache to standard list macros and had to do this to fix it
up:

static inline void prefetch(const void *x)
{
if (unlikely(!x))
return;

__asm__ __volatile__ (dcbt 0,%0 : : r (x));
}

Urgh :)

Anton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Anton Blanchard wrote:
 
 Funny you mention this. We found some noticeable ppc64 regressions when
 moving the dcache to standard list macros and had to do this to fix it
 up:
 
 static inline void prefetch(const void *x)
 {
 if (unlikely(!x))
 return;
 
 __asm__ __volatile__ (dcbt 0,%0 : : r (x));
 }
 
 Urgh :)

Yeah, I'm not at all surprised. Any implementation of prefetch that 
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes 
them weaker for *actual* prefetching) will generally have a hard time with 
a NULL pointer. Exactly because it will try to do a totally unnecessary 
TLB fill - and since most CPU's will not cache negative TLB entries, that 
unnecessary TLB fill will be done over and over and over again..

In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly 
   the above kinds of things to make sure that you don't try to prefetch 
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do 
software prefetching, because hw can usually simply do better without it.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Michael K. Edwards

On 3/7/07, Linus Torvalds [EMAIL PROTECTED] wrote

Yeah, I'm not at all surprised. Any implementation of prefetch that
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes
them weaker for *actual* prefetching) will generally have a hard time with
a NULL pointer. Exactly because it will try to do a totally unnecessary
TLB fill - and since most CPU's will not cache negative TLB entries, that
unnecessary TLB fill will be done over and over and over again..


Data prefetch instructions should indeed avoid page table walks.
(Instruction prefetch mechanisms often do induce table walks on ITLB
miss.)  Not just because of the null pointer case, but because it's
quite normal to run off the end of an array in a loop with an embedded
prefetch instruction.  If you have an extra instruction issue unit
that shares the same DTLB, and you know you will really want that
data, you can sometimes use it to force DTLB preloads by doing an
actual data fetch from the foreseeable page.  This is potentially one
of the best uses of chip multi-threading on an architecture like Sun's
Niagara.

(I don't think Intel's hyper-threading works for this purpose; the
DTLB is shared but the entries are marked as owned by one thread or
the other.  HT can be used for L2 cache prefetching, although the
results so far seem to be mixed:
http://www.cgo.org/cgo2004/papers/02_80_Kim_D_REVISED.pdf)


In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly
   the above kinds of things to make sure that you don't try to prefetch
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do
software prefetching, because hw can usually simply do better without it.


Not the XScale -- it performs quite poorly without prefetch, as people
who have run ARMv5-optimized binaries on it can testify.  From the
XScale Core Developer's Manual:

quote
The Intel XScale(r) core has a true prefetch load instruction (PLD).
The purpose of this instruction is to preload data into the data and
mini-data caches. Data prefetching allows hiding of memory transfer
latency while the processor continues to execute instructions. The
prefetch is important to compiler and assembly code because judicious
use of the prefetch instruction can enormously improve throughput
performance of the core. Data prefetch can be applied not only to
loops but also to any data references within a block of code. Prefetch
also applies to data writing when the memory type is enabled as write
allocate

The Intel XScale(r) core prefetch load instruction is a true prefetch
instruction because the load destination is the data or mini-data
cache and not a register. Compilers for processors which have data
caches, but do not support prefetch, sometimes use a load instruction
to preload the data cache. This technique has the disadvantages of
using a register to load data and requiring additional registers for
subsequent preloads and thus increasing register pressure. By
contrast, the prefetch can be used to reduce register pressure instead
of increasing it.

The prefetch load is a hint instruction and does not guarantee that
the data will be loaded. Whenever the load would cause a fault or a
table walk, then the processor will ignore the prefetch instruction,
the fault or table walk, and continue processing the next instruction.
This is particularly advantageous in the case where a linked list or
recursive data structure is terminated by a NULL pointer. Prefetching
the NULL pointer will not fault program flow.
/quote

People's prejudices against prefetch instructions are sometimes
traceable to the 3DNow! prefetch(w) botch, which some processors
support as no-ops and others are too aggressive about (Opteron
prefetches are reputed to be strong, i. e., not dropped on DTLB
miss).  XScale gets it right.  So do most Pentium 4's using the SSE
prefetches, according to the IA-32 optimization manual.  (Oddly,
Prescott seems to have initiated a page table walk on DTLB miss during
software prefetch -- just one of many weird Prescott flaws.)  I'm
guessing Pentium M and its descendants (Core Solo and Duo) get it
right but I'm having a hell of a time finding out for sure.  Can any
of the x86 experts answer this?

Cheers,
- Michael
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Kyle Moffett

On Mar 07, 2007, at 20:25:14, Michael K. Edwards wrote:

On 3/7/07, Linus Torvalds [EMAIL PROTECTED] wrote

In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do  
exactly the above kinds of things to make sure that you don't try  
to prefetch the non-existent end entry)

AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not*  
do software prefetching, because hw can usually simply do better  
without it.


Not the XScale -- it performs quite poorly without prefetch, as  
people who have run ARMv5-optimized binaries on it can testify.


The Intel XScale(r) core prefetch load instruction is a true  
prefetch instruction because the load destination is the data or  
mini-data cache and not a register. Compilers for processors which  
have data caches, but do not support prefetch, sometimes use a load  
instruction to preload the data cache. This technique has the  
disadvantages of using a register to load data and requiring  
additional registers for
subsequent preloads and thus increasing register pressure. By  
contrast, the prefetch can be used to reduce register pressure  
instead of increasing it.


The prefetch load is a hint instruction and does not guarantee that  
the data will be loaded. Whenever the load would cause a fault or a  
table walk, then the processor will ignore the prefetch  
instruction, the fault or table walk, and continue processing the  
next instruction. This is particularly advantageous in the case  
where a linked list or recursive data structure is terminated by a  
NULL pointer. Prefetching the NULL pointer will not fault program  
flow.


Prefetching is also fairly critical on a Power4 or G5 PowerPC system  
as they have a long memory latency; an L2-cache miss can cost 200+  
cycles.  On such systems the dcbt prefetch instruction brings in a  
single 128-byte cacheline and has no serializing effects whatsoever,  
making it ideal for use in a linked-list-traversal inner loop.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Linus Torvalds


On Wed, 7 Mar 2007, Michael K. Edwards wrote:
 
 People's prejudices against prefetch instructions are sometimes
 traceable to the 3DNow! prefetch(w) botch, which some processors
 support as no-ops and others are too aggressive about (Opteron
 prefetches are reputed to be strong, i. e., not dropped on DTLB
 miss).

No, I just checked, and Intel's own optimization manual makes it clear 
that you should be careful. They talk about performance penalties due to 
resource constraints - which makes tons of sense with a core that is good 
at handling its own resources and could quite possibly use those resources 
better to actually execute the loads and stores deeper down the 
instruction pipeline.

So it's not just 3DNow! making AMD look bad, or Intel would obviously 
suggest people use it out of the wazoo ;)

 XScale gets it right.

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching. 
Calling that getting it right is ludicrous. If anything, it gets things 
so wrong that prefetching is *required* for good performance.

I'm talking about real CPU's with real memory pipelines that already do 
prefetching in hardware. The better the core is, the less the prefetch 
helps (and often the more it hurts in comparison to how much it helps).

But if you mean doesn't try to fill the TLB on data prefetches, then 
yes, that's generally the right thing to do.

 (Oddly, Prescott seems to have initiated a page table walk on DTLB miss 
 during software prefetch -- just one of many weird Prescott flaws.)  

Netburst in general is *very* happy to do speculative TLB fills, I think.

 I'm guessing Pentium M and its descendants (Core Solo and Duo) get it 
 right but I'm having a hell of a time finding out for sure.  Can any of 
 the x86 experts answer this?

I just suspect that the upside for Core 2 Due is likely fairly low. The L2 
cache is good, the memory re-ordering is working.. I doubt prefetch 
helps in generic code that much for things like linked list following, you 
should probably limit it to code that has *known* access patterns and you 
know it's not going to be in the cache.

(In other words, I bet prefetching can help a lot with MMX/media kind of 
code, I doubt it's a huge win for for_each_entry())

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-07 Thread Eric Dumazet

Kyle Moffett a écrit :


Prefetching is also fairly critical on a Power4 or G5 PowerPC system as 
they have a long memory latency; an L2-cache miss can cost 200+ cycles.  
On such systems the dcbt prefetch instruction brings in a single 
128-byte cacheline and has no serializing effects whatsoever, making it 
ideal for use in a linked-list-traversal inner loop.


OK, 200 cycles...

But what is the cost of the conditional branch you added in prefetch(x) ?

if (!x) return;

(correctly predicted or not, but do powerPC have a BTB ?)

About the NULL 'potential problem', maybe we could use a dummy nil (but 
mapped) object, and use its address in lists, ie compare for nil instead of 
NULL. This would avoid :


- The conditional test in some prefetch() implementations
- The potential TLB problem with the NULL value.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Eric Dumazet

Eric Dumazet a écrit :


I would definitly *love* saving dentries for pipes (and sockets too), 
but how are you going to get the inode ?


pipes()/sockets() can use read()/write()/rw_verify_area() and thus need 
file->f_path.dentry->d_inode (so each pipe needs a separate dentry)


Are you suggesting adding a new "struct file_operations" member to get 
the inode ?

Or re-intoducing an inode pointer in struct file ?


Crazy ideas : (some readers are going to kill me)

1) Use the low order bit of f_path.dentry to say : this pointer is not a 
pointer to a dentry but the inode pointer (with the low order bit set to 1)


OR

2) file->f_path.dentry set to NULL for this special files (so that we dont 
need to dput() and cache line ping pong the common dentry each time we 
__fput() a pipe/socket.


Same trick could be used for file->f_path.mnt, because there is a big SMP 
cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint while 
these file systems cannot be un-mounted)


If dentry is NULL, we get the inode pointer from an overlay of struct 
file_ra_statef_ra; (because for this special files readahead is unused)


This adds some conditional branches of course, but being able to save ram and 
better use cpu caches might be worth them.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Wed, 7 Mar 2007, Eric Dumazet wrote:

> I would definitly *love* saving dentries for pipes (and sockets too), but how
> are you going to get the inode ?

I was not planning to touch anything but epoll, signalfd and timerfd 
files.


> pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
> file->f_path.dentry->d_inode (so each pipe needs a separate dentry)

Currently, they use a single inode, and multiple dentries (to give the 
name of the class). But this could be changed to a single dentry like 
Linus was suggesting. I'll wait for Al's reply before doing anything.
Memory saving can be something, on top of the already big one of avoiding 
code duplication.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Eric Dumazet

Linus Torvalds a écrit :


I assume that the *only* reason for having multiple dentries is really 
just the output in /proc//fd/, right? Or is there any other reason to 
have separate dentries for these pseudo-files?


It's a bit sad to waste that much memory (and time) on something like 
that. I bet that the dentry setup is a noticeable part of the whole 
sigfd()/timerfd() setup. It's likely also a big part of any memory 
footprint if you have lots of them.


So how about just doing:
 - do a single dentry
 - make a "struct file_operations" member function that prints out the 
   name of the thing in /proc//fd/, and which *defaults* to just 
   doing the d_path() on the dentry, but special filesystems like this 
   could do something else (like print out a fake inode number from the 
   "file->f_private_data" information)


There seems to really be no downsides to that approach. No existing 
filesystem will even notice (they'll all have NULL in the new f_op 
member), and it would allow pipes etc to be sped up and use less memory.




I would definitly *love* saving dentries for pipes (and sockets too), but how 
are you going to get the inode ?


pipes()/sockets() can use read()/write()/rw_verify_area() and thus need 
file->f_path.dentry->d_inode (so each pipe needs a separate dentry)


Are you suggesting adding a new "struct file_operations" member to get the 
inode ?
Or re-intoducing an inode pointer in struct file ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Tue, 6 Mar 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 6 Mar 2007, Davide Libenzi wrote:
> > 
> > I'm OK with everything that avoid code duplication due to those fake 
> > inodes. The change can be localized inside the existing API, so it doesn't 
> > really affect me externally.
> 
> Can you try with the first patch version that doesn't do anything special 
> at all, and just uses a single dentry.
> 
> Yeah, the dentry name will be identical, and so you would see something 
> like "7 -> signalfd:signalfd" when you do "ls -l /proc//fd/" on a 
> task that has such a special file descriptor (with no way to tell 
> different timerfd's and signalfd's apart), but I think it's better to 
> start off simple than to overdesign things.
> 
> And trying to tell them apart sounds a bit like overdesign, if only 
> because I really don't see why anybody would really *care*. So it's a 
> timer for poll/select/epoll - why care about anything else?

The code would not change/shrink much with the single dentry. We'd save 
memory, and we'd lose the capability of seeing aino:[CLASS].
Both ways are fine with me.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Linus Torvalds


On Tue, 6 Mar 2007, Davide Libenzi wrote:
> 
> I'm OK with everything that avoid code duplication due to those fake 
> inodes. The change can be localized inside the existing API, so it doesn't 
> really affect me externally.

Can you try with the first patch version that doesn't do anything special 
at all, and just uses a single dentry.

Yeah, the dentry name will be identical, and so you would see something 
like "7 -> signalfd:signalfd" when you do "ls -l /proc//fd/" on a 
task that has such a special file descriptor (with no way to tell 
different timerfd's and signalfd's apart), but I think it's better to 
start off simple than to overdesign things.

And trying to tell them apart sounds a bit like overdesign, if only 
because I really don't see why anybody would really *care*. So it's a 
timer for poll/select/epoll - why care about anything else?

If it really really turns out that people care, we know how we can do it. 
We'd hook into "proc_fd_link()" and we'd allow a per-file mntget/dget that 
we could use to let special filesystems create fake entries on demand. So 
it's not impossible, it's just likely simply not needed.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Tue, 6 Mar 2007, Linus Torvalds wrote:

> 
> [ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He 
>   has veto powers even over my proposals ;^]
> 
> On Tue, 6 Mar 2007, Davide Libenzi wrote:
> > 
> > I currently have the dentry to carry the name of the file* class it is 
> > linked to. I'd prefer to keep it that way, unless there are huge factors 
> > against.
> 
> I assume that the *only* reason for having multiple dentries is really 
> just the output in /proc//fd/, right? Or is there any other reason to 
> have separate dentries for these pseudo-files?
> 
> It's a bit sad to waste that much memory (and time) on something like 
> that. I bet that the dentry setup is a noticeable part of the whole 
> sigfd()/timerfd() setup. It's likely also a big part of any memory 
> footprint if you have lots of them.
> 
> So how about just doing:
>  - do a single dentry
>  - make a "struct file_operations" member function that prints out the 
>name of the thing in /proc//fd/, and which *defaults* to just 
>doing the d_path() on the dentry, but special filesystems like this 
>could do something else (like print out a fake inode number from the 
>"file->f_private_data" information)
> 
> There seems to really be no downsides to that approach. No existing 
> filesystem will even notice (they'll all have NULL in the new f_op 
> member), and it would allow pipes etc to be sped up and use less memory.

I'm OK with everything that avoid code duplication due to those fake 
inodes. The change can be localized inside the existing API, so it doesn't 
really affect me externally.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Linus Torvalds

[ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He 
  has veto powers even over my proposals ;^]

On Tue, 6 Mar 2007, Davide Libenzi wrote:
> 
> I currently have the dentry to carry the name of the file* class it is 
> linked to. I'd prefer to keep it that way, unless there are huge factors 
> against.

I assume that the *only* reason for having multiple dentries is really 
just the output in /proc//fd/, right? Or is there any other reason to 
have separate dentries for these pseudo-files?

It's a bit sad to waste that much memory (and time) on something like 
that. I bet that the dentry setup is a noticeable part of the whole 
sigfd()/timerfd() setup. It's likely also a big part of any memory 
footprint if you have lots of them.

So how about just doing:
 - do a single dentry
 - make a "struct file_operations" member function that prints out the 
   name of the thing in /proc//fd/, and which *defaults* to just 
   doing the d_path() on the dentry, but special filesystems like this 
   could do something else (like print out a fake inode number from the 
   "file->f_private_data" information)

There seems to really be no downsides to that approach. No existing 
filesystem will even notice (they'll all have NULL in the new f_op 
member), and it would allow pipes etc to be sped up and use less memory.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Tue, 6 Mar 2007, Avi Kivity wrote:

> Right.  For kvmfs this isn't a problem as there's a 1:1 relationship 
> between synthetic inodes and dentries.  Perhaps d_alloc_anon() could be 
> extended to avoid the scan if it's a problem.

I currently have the dentry to carry the name of the file* class it is 
linked to. I'd prefer to keep it that way, unless there are huge factors 
against.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Tue, 6 Mar 2007, Avi Kivity wrote:

 Right.  For kvmfs this isn't a problem as there's a 1:1 relationship 
 between synthetic inodes and dentries.  Perhaps d_alloc_anon() could be 
 extended to avoid the scan if it's a problem.

I currently have the dentry to carry the name of the file* class it is 
linked to. I'd prefer to keep it that way, unless there are huge factors 
against.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Linus Torvalds

[ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He 
  has veto powers even over my proposals ;^]

On Tue, 6 Mar 2007, Davide Libenzi wrote:
 
 I currently have the dentry to carry the name of the file* class it is 
 linked to. I'd prefer to keep it that way, unless there are huge factors 
 against.

I assume that the *only* reason for having multiple dentries is really 
just the output in /proc/pid/fd/, right? Or is there any other reason to 
have separate dentries for these pseudo-files?

It's a bit sad to waste that much memory (and time) on something like 
that. I bet that the dentry setup is a noticeable part of the whole 
sigfd()/timerfd() setup. It's likely also a big part of any memory 
footprint if you have lots of them.

So how about just doing:
 - do a single dentry
 - make a struct file_operations member function that prints out the 
   name of the thing in /proc/pid/fd/, and which *defaults* to just 
   doing the d_path() on the dentry, but special filesystems like this 
   could do something else (like print out a fake inode number from the 
   file-f_private_data information)

There seems to really be no downsides to that approach. No existing 
filesystem will even notice (they'll all have NULL in the new f_op 
member), and it would allow pipes etc to be sped up and use less memory.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Tue, 6 Mar 2007, Linus Torvalds wrote:

 
 [ Al Viro added to Cc: as the arbiter of good taste in the VFS layer. He 
   has veto powers even over my proposals ;^]
 
 On Tue, 6 Mar 2007, Davide Libenzi wrote:
  
  I currently have the dentry to carry the name of the file* class it is 
  linked to. I'd prefer to keep it that way, unless there are huge factors 
  against.
 
 I assume that the *only* reason for having multiple dentries is really 
 just the output in /proc/pid/fd/, right? Or is there any other reason to 
 have separate dentries for these pseudo-files?
 
 It's a bit sad to waste that much memory (and time) on something like 
 that. I bet that the dentry setup is a noticeable part of the whole 
 sigfd()/timerfd() setup. It's likely also a big part of any memory 
 footprint if you have lots of them.
 
 So how about just doing:
  - do a single dentry
  - make a struct file_operations member function that prints out the 
name of the thing in /proc/pid/fd/, and which *defaults* to just 
doing the d_path() on the dentry, but special filesystems like this 
could do something else (like print out a fake inode number from the 
file-f_private_data information)
 
 There seems to really be no downsides to that approach. No existing 
 filesystem will even notice (they'll all have NULL in the new f_op 
 member), and it would allow pipes etc to be sped up and use less memory.

I'm OK with everything that avoid code duplication due to those fake 
inodes. The change can be localized inside the existing API, so it doesn't 
really affect me externally.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Linus Torvalds


On Tue, 6 Mar 2007, Davide Libenzi wrote:
 
 I'm OK with everything that avoid code duplication due to those fake 
 inodes. The change can be localized inside the existing API, so it doesn't 
 really affect me externally.

Can you try with the first patch version that doesn't do anything special 
at all, and just uses a single dentry.

Yeah, the dentry name will be identical, and so you would see something 
like 7 - signalfd:signalfd when you do ls -l /proc/pid/fd/ on a 
task that has such a special file descriptor (with no way to tell 
different timerfd's and signalfd's apart), but I think it's better to 
start off simple than to overdesign things.

And trying to tell them apart sounds a bit like overdesign, if only 
because I really don't see why anybody would really *care*. So it's a 
timer for poll/select/epoll - why care about anything else?

If it really really turns out that people care, we know how we can do it. 
We'd hook into proc_fd_link() and we'd allow a per-file mntget/dget that 
we could use to let special filesystems create fake entries on demand. So 
it's not impossible, it's just likely simply not needed.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Tue, 6 Mar 2007, Linus Torvalds wrote:

 
 
 On Tue, 6 Mar 2007, Davide Libenzi wrote:
  
  I'm OK with everything that avoid code duplication due to those fake 
  inodes. The change can be localized inside the existing API, so it doesn't 
  really affect me externally.
 
 Can you try with the first patch version that doesn't do anything special 
 at all, and just uses a single dentry.
 
 Yeah, the dentry name will be identical, and so you would see something 
 like 7 - signalfd:signalfd when you do ls -l /proc/pid/fd/ on a 
 task that has such a special file descriptor (with no way to tell 
 different timerfd's and signalfd's apart), but I think it's better to 
 start off simple than to overdesign things.
 
 And trying to tell them apart sounds a bit like overdesign, if only 
 because I really don't see why anybody would really *care*. So it's a 
 timer for poll/select/epoll - why care about anything else?

The code would not change/shrink much with the single dentry. We'd save 
memory, and we'd lose the capability of seeing aino:[CLASS].
Both ways are fine with me.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Eric Dumazet

Linus Torvalds a écrit :


I assume that the *only* reason for having multiple dentries is really 
just the output in /proc/pid/fd/, right? Or is there any other reason to 
have separate dentries for these pseudo-files?


It's a bit sad to waste that much memory (and time) on something like 
that. I bet that the dentry setup is a noticeable part of the whole 
sigfd()/timerfd() setup. It's likely also a big part of any memory 
footprint if you have lots of them.


So how about just doing:
 - do a single dentry
 - make a struct file_operations member function that prints out the 
   name of the thing in /proc/pid/fd/, and which *defaults* to just 
   doing the d_path() on the dentry, but special filesystems like this 
   could do something else (like print out a fake inode number from the 
   file-f_private_data information)


There seems to really be no downsides to that approach. No existing 
filesystem will even notice (they'll all have NULL in the new f_op 
member), and it would allow pipes etc to be sped up and use less memory.




I would definitly *love* saving dentries for pipes (and sockets too), but how 
are you going to get the inode ?


pipes()/sockets() can use read()/write()/rw_verify_area() and thus need 
file-f_path.dentry-d_inode (so each pipe needs a separate dentry)


Are you suggesting adding a new struct file_operations member to get the 
inode ?
Or re-intoducing an inode pointer in struct file ?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Davide Libenzi
On Wed, 7 Mar 2007, Eric Dumazet wrote:

 I would definitly *love* saving dentries for pipes (and sockets too), but how
 are you going to get the inode ?

I was not planning to touch anything but epoll, signalfd and timerfd 
files.


 pipes()/sockets() can use read()/write()/rw_verify_area() and thus need
 file-f_path.dentry-d_inode (so each pipe needs a separate dentry)

Currently, they use a single inode, and multiple dentries (to give the 
name of the class). But this could be changed to a single dentry like 
Linus was suggesting. I'll wait for Al's reply before doing anything.
Memory saving can be something, on top of the already big one of avoiding 
code duplication.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-06 Thread Eric Dumazet

Eric Dumazet a écrit :


I would definitly *love* saving dentries for pipes (and sockets too), 
but how are you going to get the inode ?


pipes()/sockets() can use read()/write()/rw_verify_area() and thus need 
file-f_path.dentry-d_inode (so each pipe needs a separate dentry)


Are you suggesting adding a new struct file_operations member to get 
the inode ?

Or re-intoducing an inode pointer in struct file ?


Crazy ideas : (some readers are going to kill me)

1) Use the low order bit of f_path.dentry to say : this pointer is not a 
pointer to a dentry but the inode pointer (with the low order bit set to 1)


OR

2) file-f_path.dentry set to NULL for this special files (so that we dont 
need to dput() and cache line ping pong the common dentry each time we 
__fput() a pipe/socket.


Same trick could be used for file-f_path.mnt, because there is a big SMP 
cache line ping/pong to maintain a mnt_count on pipe/sockets mountpoint while 
these file systems cannot be un-mounted)


If dentry is NULL, we get the inode pointer from an overlay of struct 
file_ra_statef_ra; (because for this special files readahead is unused)


This adds some conditional branches of course, but being able to save ram and 
better use cpu caches might be worth them.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-05 Thread Davide Libenzi
On Tue, 6 Mar 2007, Avi Kivity wrote:

> >  /* File callbacks that implement the eventpoll file behaviour */
> >  static const struct file_operations eventpoll_fops = {
> > .release= ep_eventpoll_close,
> > @@ -763,15 +767,18 @@
> >  * using the inode number.
> >  */
> > error = -ENOMEM;
> > -   sprintf(name, "[%lu]", inode->i_ino);
> > +   sprintf(name, "[%p]", ep);
> > this.name = name;
> > this.len = strlen(name);
> > -   this.hash = inode->i_ino;
> > +   this.hash = 0;
> > dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, );
> > if (!dentry)
> > goto eexit_4;
> > dentry->d_op = _dentry_operations;
> > -   d_add(dentry, inode);
> > +   /* Do not publish this dentry inside the global dentry hash table */
> > +   dentry->d_flags &= ~DCACHE_UNHASHED;
> > +   d_instantiate(dentry, inode);
> > +
> >   
> 
> I've used d_alloc_anon() in similar code in kvmfs.  You lose the ep 
> pointer in the name, but as a kernel address it isn't particularly 
> useful to userspace.

That ends up calling two times the find alias for an unhashed dentry, no?
And this linearly scans the inode's dentry list, no?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-05 Thread Eric Dumazet

Davide Libenzi a écrit :

 * using the inode number.
 */
error = -ENOMEM;
-   sprintf(name, "[%lu]", inode->i_ino);
+   sprintf(name, "[%p]", ep);
this.name = name;
this.len = strlen(name);


Small remark : you can avoid strlen(), since sprintf gives you the value :)

this.len = sprintf(name, "[%p]", ep);

Also, your patch description is not complete : you forgot to say that epoll 
dentries are not anymore hashed into global dcache hashtable :)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] epoll use a single inode ...

2007-03-05 Thread Davide Libenzi

Epoll does not keep any private data attached to its inode, so there'd be 
no need to allocate one inode per fd. For epoll, the inode is just a 
placeholder for the file operations and could be shared by all instances.
I'd like to use the same optimization even for the upcoming file-based 
objects, so if you see problems let me know.
One that Al was pointing out was that an fstat(2) over an epoll fd would 
show the same st_ino. IMO that should be fine since an fstat(2) over an 
epoll fd is not something you want to do in any case and expecting 
meaningfull results.



Signed-off-by: Davide Libenzi 


- Davide



eventpoll.c |   36 
1 file changed, 32 insertions(+), 4 deletions(-)



Index: linux-2.6.20.ep2/fs/eventpoll.c
===
--- linux-2.6.20.ep2.orig/fs/eventpoll.c2007-03-04 14:40:01.0 
-0800
+++ linux-2.6.20.ep2/fs/eventpoll.c 2007-03-05 13:03:52.0 -0800
@@ -258,6 +258,7 @@
   int maxevents, long timeout);
 static int eventpollfs_delete_dentry(struct dentry *dentry);
 static struct inode *ep_eventpoll_inode(void);
+static struct inode *ep_create_inode(void);
 static int eventpollfs_get_sb(struct file_system_type *fs_type,
  int flags, const char *dev_name,
  void *data, struct vfsmount *mnt);
@@ -279,6 +280,9 @@
 /* Virtual fs used to allocate inodes for eventpoll files */
 static struct vfsmount *eventpoll_mnt __read_mostly;
 
+/* Placeholder inode for eventpoll fds */
+static struct inode *eventpoll_inode;
+
 /* File callbacks that implement the eventpoll file behaviour */
 static const struct file_operations eventpoll_fops = {
.release= ep_eventpoll_close,
@@ -763,15 +767,18 @@
 * using the inode number.
 */
error = -ENOMEM;
-   sprintf(name, "[%lu]", inode->i_ino);
+   sprintf(name, "[%p]", ep);
this.name = name;
this.len = strlen(name);
-   this.hash = inode->i_ino;
+   this.hash = 0;
dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, );
if (!dentry)
goto eexit_4;
dentry->d_op = _dentry_operations;
-   d_add(dentry, inode);
+   /* Do not publish this dentry inside the global dentry hash table */
+   dentry->d_flags &= ~DCACHE_UNHASHED;
+   d_instantiate(dentry, inode);
+
file->f_path.mnt = mntget(eventpoll_mnt);
file->f_path.dentry = dentry;
file->f_mapping = inode->i_mapping;
@@ -1555,6 +1562,11 @@
 
 static int eventpollfs_delete_dentry(struct dentry *dentry)
 {
+   /*
+* We faked vfs to believe the dentry was hashed when we created it.
+* Now we restore the flag so that dput() will work correctly.
+*/
+   dentry->d_flags |= DCACHE_UNHASHED;
 
return 1;
 }
@@ -1562,6 +1574,17 @@
 
 static struct inode *ep_eventpoll_inode(void)
 {
+
+   return igrab(eventpoll_inode);
+}
+
+/*
+ * A single inode exist for all eventpoll files. On the contrary of pipes,
+ * eventpoll inodes has no per-instance data associated, so we can avoid
+ * the allocation of multiple of them.
+ */
+static struct inode *ep_create_inode(void)
+{
int error = -ENOMEM;
struct inode *inode = new_inode(eventpoll_mnt->mnt_sb);
 
@@ -1626,10 +1649,14 @@
 
/* Mount the above commented virtual file system */
eventpoll_mnt = kern_mount(_fs_type);
-   error = PTR_ERR(eventpoll_mnt);
if (IS_ERR(eventpoll_mnt))
goto epanic;
 
+   /* Create the single instance of inode for all eventpoll fds */
+   eventpoll_inode = ep_create_inode();
+   if (IS_ERR(eventpoll_inode))
+   goto epanic;
+
DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.\n",
current));
return 0;
@@ -1642,6 +1669,7 @@
 static void __exit eventpoll_exit(void)
 {
/* Undo all operations done inside eventpoll_init() */
+   iput(eventpoll_inode);
unregister_filesystem(_fs_type);
mntput(eventpoll_mnt);
kmem_cache_destroy(pwq_cache);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] epoll use a single inode ...

2007-03-05 Thread Davide Libenzi

Epoll does not keep any private data attached to its inode, so there'd be 
no need to allocate one inode per fd. For epoll, the inode is just a 
placeholder for the file operations and could be shared by all instances.
I'd like to use the same optimization even for the upcoming file-based 
objects, so if you see problems let me know.
One that Al was pointing out was that an fstat(2) over an epoll fd would 
show the same st_ino. IMO that should be fine since an fstat(2) over an 
epoll fd is not something you want to do in any case and expecting 
meaningfull results.



Signed-off-by: Davide Libenzi davidel@xmailserver.org


- Davide



eventpoll.c |   36 
1 file changed, 32 insertions(+), 4 deletions(-)



Index: linux-2.6.20.ep2/fs/eventpoll.c
===
--- linux-2.6.20.ep2.orig/fs/eventpoll.c2007-03-04 14:40:01.0 
-0800
+++ linux-2.6.20.ep2/fs/eventpoll.c 2007-03-05 13:03:52.0 -0800
@@ -258,6 +258,7 @@
   int maxevents, long timeout);
 static int eventpollfs_delete_dentry(struct dentry *dentry);
 static struct inode *ep_eventpoll_inode(void);
+static struct inode *ep_create_inode(void);
 static int eventpollfs_get_sb(struct file_system_type *fs_type,
  int flags, const char *dev_name,
  void *data, struct vfsmount *mnt);
@@ -279,6 +280,9 @@
 /* Virtual fs used to allocate inodes for eventpoll files */
 static struct vfsmount *eventpoll_mnt __read_mostly;
 
+/* Placeholder inode for eventpoll fds */
+static struct inode *eventpoll_inode;
+
 /* File callbacks that implement the eventpoll file behaviour */
 static const struct file_operations eventpoll_fops = {
.release= ep_eventpoll_close,
@@ -763,15 +767,18 @@
 * using the inode number.
 */
error = -ENOMEM;
-   sprintf(name, [%lu], inode-i_ino);
+   sprintf(name, [%p], ep);
this.name = name;
this.len = strlen(name);
-   this.hash = inode-i_ino;
+   this.hash = 0;
dentry = d_alloc(eventpoll_mnt-mnt_sb-s_root, this);
if (!dentry)
goto eexit_4;
dentry-d_op = eventpollfs_dentry_operations;
-   d_add(dentry, inode);
+   /* Do not publish this dentry inside the global dentry hash table */
+   dentry-d_flags = ~DCACHE_UNHASHED;
+   d_instantiate(dentry, inode);
+
file-f_path.mnt = mntget(eventpoll_mnt);
file-f_path.dentry = dentry;
file-f_mapping = inode-i_mapping;
@@ -1555,6 +1562,11 @@
 
 static int eventpollfs_delete_dentry(struct dentry *dentry)
 {
+   /*
+* We faked vfs to believe the dentry was hashed when we created it.
+* Now we restore the flag so that dput() will work correctly.
+*/
+   dentry-d_flags |= DCACHE_UNHASHED;
 
return 1;
 }
@@ -1562,6 +1574,17 @@
 
 static struct inode *ep_eventpoll_inode(void)
 {
+
+   return igrab(eventpoll_inode);
+}
+
+/*
+ * A single inode exist for all eventpoll files. On the contrary of pipes,
+ * eventpoll inodes has no per-instance data associated, so we can avoid
+ * the allocation of multiple of them.
+ */
+static struct inode *ep_create_inode(void)
+{
int error = -ENOMEM;
struct inode *inode = new_inode(eventpoll_mnt-mnt_sb);
 
@@ -1626,10 +1649,14 @@
 
/* Mount the above commented virtual file system */
eventpoll_mnt = kern_mount(eventpoll_fs_type);
-   error = PTR_ERR(eventpoll_mnt);
if (IS_ERR(eventpoll_mnt))
goto epanic;
 
+   /* Create the single instance of inode for all eventpoll fds */
+   eventpoll_inode = ep_create_inode();
+   if (IS_ERR(eventpoll_inode))
+   goto epanic;
+
DNPRINTK(3, (KERN_INFO [%p] eventpoll: successfully initialized.\n,
current));
return 0;
@@ -1642,6 +1669,7 @@
 static void __exit eventpoll_exit(void)
 {
/* Undo all operations done inside eventpoll_init() */
+   iput(eventpoll_inode);
unregister_filesystem(eventpoll_fs_type);
mntput(eventpoll_mnt);
kmem_cache_destroy(pwq_cache);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-05 Thread Eric Dumazet

Davide Libenzi a écrit :

 * using the inode number.
 */
error = -ENOMEM;
-   sprintf(name, [%lu], inode-i_ino);
+   sprintf(name, [%p], ep);
this.name = name;
this.len = strlen(name);


Small remark : you can avoid strlen(), since sprintf gives you the value :)

this.len = sprintf(name, [%p], ep);

Also, your patch description is not complete : you forgot to say that epoll 
dentries are not anymore hashed into global dcache hashtable :)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] epoll use a single inode ...

2007-03-05 Thread Davide Libenzi
On Tue, 6 Mar 2007, Avi Kivity wrote:

   /* File callbacks that implement the eventpoll file behaviour */
   static const struct file_operations eventpoll_fops = {
  .release= ep_eventpoll_close,
  @@ -763,15 +767,18 @@
   * using the inode number.
   */
  error = -ENOMEM;
  -   sprintf(name, [%lu], inode-i_ino);
  +   sprintf(name, [%p], ep);
  this.name = name;
  this.len = strlen(name);
  -   this.hash = inode-i_ino;
  +   this.hash = 0;
  dentry = d_alloc(eventpoll_mnt-mnt_sb-s_root, this);
  if (!dentry)
  goto eexit_4;
  dentry-d_op = eventpollfs_dentry_operations;
  -   d_add(dentry, inode);
  +   /* Do not publish this dentry inside the global dentry hash table */
  +   dentry-d_flags = ~DCACHE_UNHASHED;
  +   d_instantiate(dentry, inode);
  +

 
 I've used d_alloc_anon() in similar code in kvmfs.  You lose the ep 
 pointer in the name, but as a kernel address it isn't particularly 
 useful to userspace.

That ends up calling two times the find alias for an unhashed dentry, no?
And this linearly scans the inode's dentry list, no?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/