Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Eric Dumazet <[EMAIL PROTECTED]> wrote:

> I tried your bench and found two problems :
> - You scan half of the bitmap
[...]
> Try to close not a 'middle fd', but a really low one (10 for example), 
> and latencie is doubled.

that was intentional. I really didnt want to fabricate a worst-case 
result but something more representative: in real apps the bitmap isnt 
fully filled all the time and most of the find-bit sequences are short. 
Hence the two fds and one of them goes from the middle of the range.

> - You incorrectlty divide best_delta and worst_delta by LOOPS (5)

ah, indeed, that's a bug - victim of a last minute edit :) Since the 
divident is constant it doesnt really matter to the validity of the 
relative nature of the slowdown (which is what i was intested in), but 
you are right - i have fixed the download and have redone the numbers. 
Here are the correct results from my box:

 # ./fd-scale-bench 100 0
 checking the cache-hot performance of open()-ing 100 fds.
 num_fds: 1, best cost: 6.00 us, worst cost: 8.00 us
 num_fds: 2, best cost: 6.00 us, worst cost: 7.00 us
 ...
 num_fds: 31586, best cost: 7.00 us, worst cost: 8.00 us
 num_fds: 39483, best cost: 8.00 us, worst cost: 8.00 us
 num_fds: 49354, best cost: 7.00 us, worst cost: 9.00 us
 num_fds: 61693, best cost: 8.00 us, worst cost: 10.00 us
 num_fds: 77117, best cost: 8.00 us, worst cost: 13.00 us
 num_fds: 96397, best cost: 9.00 us, worst cost: 11.00 us
 num_fds: 120497, best cost: 10.00 us, worst cost: 14.00 us
 num_fds: 150622, best cost: 11.00 us, worst cost: 13.00 us
 num_fds: 188278, best cost: 12.00 us, worst cost: 15.00 us
 num_fds: 235348, best cost: 14.00 us, worst cost: 20.00 us
 num_fds: 294186, best cost: 16.00 us, worst cost: 22.00 us
 num_fds: 367733, best cost: 19.00 us, worst cost: 35.00 us
 num_fds: 459667, best cost: 22.00 us, worst cost: 37.00 us
 num_fds: 574584, best cost: 26.00 us, worst cost: 40.00 us
 num_fds: 718231, best cost: 31.00 us, worst cost: 62.00 us
 num_fds: 897789, best cost: 37.00 us, worst cost: 54.00 us
 num_fds: 100, best cost: 41.00 us, worst cost: 59.00 us

and cache-cold:

 # ./fd-scale-bench 100 1
 checking the cache-cold performance of open()-ing 100 fds.
 num_fds: 1, best cost: 24.00 us, worst cost: 32.00 us
 ...
 num_fds: 49354, best cost: 26.00 us, worst cost: 28.00 us
 num_fds: 61693, best cost: 25.00 us, worst cost: 30.00 us
 num_fds: 77117, best cost: 27.00 us, worst cost: 30.00 us
 num_fds: 96397, best cost: 27.00 us, worst cost: 31.00 us
 num_fds: 120497, best cost: 31.00 us, worst cost: 43.00 us
 num_fds: 150622, best cost: 31.00 us, worst cost: 34.00 us
 num_fds: 188278, best cost: 33.00 us, worst cost: 36.00 us
 num_fds: 235348, best cost: 35.00 us, worst cost: 42.00 us
 num_fds: 294186, best cost: 36.00 us, worst cost: 41.00 us
 num_fds: 367733, best cost: 40.00 us, worst cost: 43.00 us
 num_fds: 459667, best cost: 44.00 us, worst cost: 46.00 us
 num_fds: 574584, best cost: 48.00 us, worst cost: 65.00 us
 num_fds: 718231, best cost: 54.00 us, worst cost: 59.00 us
 num_fds: 897789, best cost: 60.00 us, worst cost: 62.00 us
 num_fds: 100, best cost: 65.00 us, worst cost: 68.00 us

> with a corrected bench; cache-cold numbers are > 100 us on this Intel 
> Pentium-M
> 
> num_fds: 100, best cost: 120.00 us, worst cost: 131.00 us
> 
> On an Opteron x86_64 machine, results are better :)
> 
> num_fds: 100, best cost: 28.00 us, worst cost: 106.00 us

yeah. I quoted the full range because i was really more interested of 
our current 'limit' range (which is somewhere between 50K and 100K open 
fds) where the scanning cost becomes directly measurable, and the nature 
of slowdown.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Eric Dumazet
On Thu, 31 May 2007 11:02:52 +0200
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > it's both a flexibility and a speedup thing as well:
> > 
> > flexibility: for libraries to be able to open files and keep them open 
> > comes up regularly. For example currently glibc is quite wasteful in a 
> > number of common networking related functions (Ulrich, please correct 
> > me if i'm wrong), which could be optimized if glibc could just keep a 
> > netlink channel fd open and could poll() it for changes and cache the 
> > results if there are no changes (or something like that).
> > 
> > speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
> > non-linear fds are cheaper to allocate/map:
> > 
> >   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
> > 
> > (i definitely remember having written code for that too, but i cannot 
> > find that in the archives. hm.) In theory we could avoid _all_ 
> > fd-bitmap overhead as well and use a per-process list/pool of struct 
> > file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
> > (at the price of only deallocating them at process exit time).
> 
> to measure this i've written fd-scale-bench.c:
> 
>http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c
> 
> which tests the (cache-hot or cache-cold) cost of open()-ing of two fds 
> while there are N other fds already open: one is from the 'middle' of 
> the range, one is from the end of it.
> 
> Lets check our current 'extreme high end' performance with 1 million 
> fds. (which is not realistic right now but there certainly are systems 
> with over a hundred thousand open fds). Results from a fast CPU with 2MB 
> of cache:
> 
>  cache-hot:
> 
>  # ./fd-scale-bench 100 0
>  checking the cache-hot performance of open()-ing 100 fds.
>  num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us
>  num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us
>  num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us
>  num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us
>  ...
>  num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us
>  num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us
>  num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us
>  num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us
>  num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us
>  num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us
>  num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us
>  num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us
>  num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us
>  num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us
>  num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us
>  num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us
>  num_fds: 100, best cost: 8.20 us, worst cost: 9.60 us
> 
>  cache-cold:
> 
>  # ./fd-scale-bench 100 1
>  checking the performance of open()-ing 100 fds.
>  num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us
>  num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us
>  ...
>  num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us
>  num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us
>  num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us
>  num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us
>  num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us
>  num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us
>  num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us
>  num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us
>  num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us
>  num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us
>  num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us
>  num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us
>  num_fds: 100, best cost: 13.60 us, worst cost: 15.40 us
> 
> we are pretty good at the moment: the open() cost starts to increase at 
> around 100K open fds, both in the cache-cold and cache-hot case. (that 
> roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At 
> 1 million fds our fd bitmap has a size of 128K when there are 1 million 
> fds open in a single process.
> 
> so while it's certainly not 'urgent' to improve this, private fds are an 
> easier target for optimizations in this area, because they dont have the 
> continuity requirement anymore, so the fd bitmap is not a 'forced' 
> property of them.

Your numbers do not match mines (mines were more than two years old so I redid 
a test before replying)

I tried your bench and found two problems :
- You scan half of the bitmap
- You incorrectlty divide best_delta and worst_delta by LOOPS (5)

Try to close not a 'middle fd', but a really low one (10 for example), and 
latencie is doubled.

with a corrected bench; cache-cold numbers are > 100 us on this Intel Pentium-M

num_fds: 100, best cost: 120.00 us, worst cost: 131.00 us

On an Opteron x86_64 machine, results are better 

Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Albert Cahalan <[EMAIL PROTECTED]> wrote:

> Ingo Molnar writes:
> 
> >looking over the list of our new generic APIs (see further below) i
> >think there are three important things that are needed for an API to
> >become widely used:
> >
> > 1) it should solve a real problem (ha ;-), it should be intuitive to
> >humans and it should fit into existing things naturally.
> >
> > 2) it should be ubiquitous. (if it's about IO it should cover block IO,
> >network IO, timers, signals and everything) Even if it might look
> >silly in some of the cases, having complete, utter, no compromises,
> >100% coverage for everything massively helps the uptake of an API,
> >because it allows the user-space coder to pick just one paradigm
> >that is closest to his application and stick to it and only to it.
> >
> > 3) it should be end-to-end supported by glibc.
> 
> 4) At least slightly portable.
> 
> Anything supported by any similar OS is already ahead, even if it 
> isn't the perfect API of our dreams. [...]

it might have been so a few years ago but it's changing slowly but 
surely - BSD is becoming more and more irrelevant. What matters mostly 
to app writers these days: "is it in most Linux distros" - and the key 
to that is upstream kernel support and glibc support. The days of BSD 
(and UNIX) are pretty much numbered. (I'm not against standardizing APIs 
in POSIX of course - the BSDs tend to follow the Linux APIs in that area 
with a few years lag.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Ingo Molnar wrote:
> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > (i definitely remember having written code for that too, but i cannot 
> > find that in the archives. hm.) In theory we could avoid _all_ 
> > fd-bitmap overhead as well and use a per-process list/pool of struct 
> > file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
> > (at the price of only deallocating them at process exit time).
> 
> btw., this also allows mostly-lockless fd allocation, which would 
> probably benefit threaded apps too. (we can just recycle it from a 
> per-CPU list of cached fds for that process)

See also:

http://lkml.org/lkml/2006/6/16/144

which originates from a much simpler patch I did to fix performance
regressions in this area for the SLES10 kernel.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> (i definitely remember having written code for that too, but i cannot 
> find that in the archives. hm.) In theory we could avoid _all_ 
> fd-bitmap overhead as well and use a per-process list/pool of struct 
> file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
> (at the price of only deallocating them at process exit time).

btw., this also allows mostly-lockless fd allocation, which would 
probably benefit threaded apps too. (we can just recycle it from a 
per-CPU list of cached fds for that process)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Eric Dumazet <[EMAIL PROTECTED]> wrote:

> > speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
> > non-linear fds are cheaper to allocate/map:
> > 
> >   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
> > 
> > (i definitely remember having written code for that too, but i 
> > cannot find that in the archives. hm.) In theory we could avoid 
> > _all_ fd-bitmap overhead as well and use a per-process list/pool of 
> > struct file buffers plus a maximum-fd field as the 'non-linear fd 
> > allocator' (at the price of only deallocating them at process exit 
> > time).
> 
> Only very few apps need to open more than 100.000 files.

yes. I did not list it as a primary reason for private fds, it's just a 
nice side-effect. As long as the other apps are not hurt, i see no 
problem in improving the >100K open files case.

> As these files are likely sockets, O_ANY is not a solution.

why not? It would be a natural thing to extend sys_socket() with a 
'flags' parameter and pass in O_ANY (along with any other possible fd 
parameter like O_NDELAY, which could be inherited over connect()).

> A trick is to try to keep first 64 handles freed, so that kernel wont 
> consume too much cpu time and cache in get_unused_fd()
> 
> http://lkml.org/lkml/2005/9/15/307

this is basically a user-space front-end cache to fd allocation - which 
duplicates data needlessly. I dont see any problem with doing this in 
the kernel. (Also, obviously 'first 64 handles' could easily break with 
certain types of apps so glibc cannot do this.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> it's both a flexibility and a speedup thing as well:
> 
> flexibility: for libraries to be able to open files and keep them open 
> comes up regularly. For example currently glibc is quite wasteful in a 
> number of common networking related functions (Ulrich, please correct 
> me if i'm wrong), which could be optimized if glibc could just keep a 
> netlink channel fd open and could poll() it for changes and cache the 
> results if there are no changes (or something like that).
> 
> speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
> non-linear fds are cheaper to allocate/map:
> 
>   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
> 
> (i definitely remember having written code for that too, but i cannot 
> find that in the archives. hm.) In theory we could avoid _all_ 
> fd-bitmap overhead as well and use a per-process list/pool of struct 
> file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
> (at the price of only deallocating them at process exit time).

to measure this i've written fd-scale-bench.c:

   http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c

which tests the (cache-hot or cache-cold) cost of open()-ing of two fds 
while there are N other fds already open: one is from the 'middle' of 
the range, one is from the end of it.

Lets check our current 'extreme high end' performance with 1 million 
fds. (which is not realistic right now but there certainly are systems 
with over a hundred thousand open fds). Results from a fast CPU with 2MB 
of cache:

 cache-hot:

 # ./fd-scale-bench 100 0
 checking the cache-hot performance of open()-ing 100 fds.
 num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us
 num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us
 num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us
 num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us
 ...
 num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us
 num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us
 num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us
 num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us
 num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us
 num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us
 num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us
 num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us
 num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us
 num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us
 num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us
 num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us
 num_fds: 100, best cost: 8.20 us, worst cost: 9.60 us

 cache-cold:

 # ./fd-scale-bench 100 1
 checking the performance of open()-ing 100 fds.
 num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us
 num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us
 ...
 num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us
 num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us
 num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us
 num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us
 num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us
 num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us
 num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us
 num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us
 num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us
 num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us
 num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us
 num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us
 num_fds: 100, best cost: 13.60 us, worst cost: 15.40 us

we are pretty good at the moment: the open() cost starts to increase at 
around 100K open fds, both in the cache-cold and cache-hot case. (that 
roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At 
1 million fds our fd bitmap has a size of 128K when there are 1 million 
fds open in a single process.

so while it's certainly not 'urgent' to improve this, private fds are an 
easier target for optimizations in this area, because they dont have the 
continuity requirement anymore, so the fd bitmap is not a 'forced' 
property of them.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Albert Cahalan

Ingo Molnar writes:


looking over the list of our new generic APIs (see further below) i
think there are three important things that are needed for an API to
become widely used:

 1) it should solve a real problem (ha ;-), it should be intuitive to
humans and it should fit into existing things naturally.

 2) it should be ubiquitous. (if it's about IO it should cover block IO,
network IO, timers, signals and everything) Even if it might look
silly in some of the cases, having complete, utter, no compromises,
100% coverage for everything massively helps the uptake of an API,
because it allows the user-space coder to pick just one paradigm
that is closest to his application and stick to it and only to it.

 3) it should be end-to-end supported by glibc.


4) At least slightly portable.

Anything supported by any similar OS is already ahead, even if it
isn't the perfect API of our dreams. This means kqueue and doors.

If it's not on any BSD or UNIX, then most app developers won't
touch it. Worse yet, it won't appear in programming books, so even
the Linux-only app programmers won't know about it.

Running ideas by the FreeBSD and OpenSolaris developers wouldn't
be a bad idea. Agreement leads to standardization, which leads to
interfaces getting used.

BTW, wrapper libraries that bury the new API under a layer of
gunk are not helpful. One might as well just use the old API.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Eric Dumazet
On Thu, 31 May 2007 08:13:03 +0200
Ingo Molnar <[EMAIL PROTECTED]> wrote:

> 
> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > > I agree. What would be a good interface to allocate fds in such 
> > > area? We don't want to replicate syscalls, so maybe a special new 
> > > dup function?
> > 
> > I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or 
> > similar, and just have NONLINEAR_FD be some magic value (for example, 
> > make it be 0x4000 - the bit that says "private, nonlinear" in the 
> > first place).
> > 
> > But what's gotten lost in the current discussion is that we probably 
> > don't actually _need_ such a private space. I'm just saying that if 
> > the *choice* is between memory-mapped interfaces and a private 
> > fd-space, we should probably go for the latter. "Everything is a file" 
> > is the UNIX way, after all. But there's little reason to introduce 
> > private fd's otherwise.
> 
> it's both a flexibility and a speedup thing as well:
> 
> flexibility: for libraries to be able to open files and keep them open 
> comes up regularly. For example currently glibc is quite wasteful in a 
> number of common networking related functions (Ulrich, please correct me 
> if i'm wrong), which could be optimized if glibc could just keep a 
> netlink channel fd open and could poll() it for changes and cache the 
> results if there are no changes (or something like that).
> 
> speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
> non-linear fds are cheaper to allocate/map:
> 
>   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
> 
> (i definitely remember having written code for that too, but i cannot 
> find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap 
> overhead as well and use a per-process list/pool of struct file buffers 
> plus a maximum-fd field as the 'non-linear fd allocator' (at the price 
> of only deallocating them at process exit time).

Only very few apps need to open more than 100.000 files.

As these files are likely sockets, O_ANY is not a solution.

A trick is to try to keep first 64 handles freed, so that kernel wont consume
too much cpu time and cache in get_unused_fd()

http://lkml.org/lkml/2005/9/15/307

This trick is portable (not linux centric).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > I agree. What would be a good interface to allocate fds in such 
> > area? We don't want to replicate syscalls, so maybe a special new 
> > dup function?
> 
> I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or 
> similar, and just have NONLINEAR_FD be some magic value (for example, 
> make it be 0x4000 - the bit that says "private, nonlinear" in the 
> first place).
> 
> But what's gotten lost in the current discussion is that we probably 
> don't actually _need_ such a private space. I'm just saying that if 
> the *choice* is between memory-mapped interfaces and a private 
> fd-space, we should probably go for the latter. "Everything is a file" 
> is the UNIX way, after all. But there's little reason to introduce 
> private fd's otherwise.

it's both a flexibility and a speedup thing as well:

flexibility: for libraries to be able to open files and keep them open 
comes up regularly. For example currently glibc is quite wasteful in a 
number of common networking related functions (Ulrich, please correct me 
if i'm wrong), which could be optimized if glibc could just keep a 
netlink channel fd open and could poll() it for changes and cache the 
results if there are no changes (or something like that).

speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
non-linear fds are cheaper to allocate/map:

  http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html

(i definitely remember having written code for that too, but i cannot 
find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap 
overhead as well and use a per-process list/pool of struct file buffers 
plus a maximum-fd field as the 'non-linear fd allocator' (at the price 
of only deallocating them at process exit time).

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

  I agree. What would be a good interface to allocate fds in such 
  area? We don't want to replicate syscalls, so maybe a special new 
  dup function?
 
 I'd do it with something like newfd = dup2(fd, NONLINEAR_FD) or 
 similar, and just have NONLINEAR_FD be some magic value (for example, 
 make it be 0x4000 - the bit that says private, nonlinear in the 
 first place).
 
 But what's gotten lost in the current discussion is that we probably 
 don't actually _need_ such a private space. I'm just saying that if 
 the *choice* is between memory-mapped interfaces and a private 
 fd-space, we should probably go for the latter. Everything is a file 
 is the UNIX way, after all. But there's little reason to introduce 
 private fd's otherwise.

it's both a flexibility and a speedup thing as well:

flexibility: for libraries to be able to open files and keep them open 
comes up regularly. For example currently glibc is quite wasteful in a 
number of common networking related functions (Ulrich, please correct me 
if i'm wrong), which could be optimized if glibc could just keep a 
netlink channel fd open and could poll() it for changes and cache the 
results if there are no changes (or something like that).

speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
non-linear fds are cheaper to allocate/map:

  http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html

(i definitely remember having written code for that too, but i cannot 
find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap 
overhead as well and use a per-process list/pool of struct file buffers 
plus a maximum-fd field as the 'non-linear fd allocator' (at the price 
of only deallocating them at process exit time).

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Eric Dumazet
On Thu, 31 May 2007 08:13:03 +0200
Ingo Molnar [EMAIL PROTECTED] wrote:

 
 * Linus Torvalds [EMAIL PROTECTED] wrote:
 
   I agree. What would be a good interface to allocate fds in such 
   area? We don't want to replicate syscalls, so maybe a special new 
   dup function?
  
  I'd do it with something like newfd = dup2(fd, NONLINEAR_FD) or 
  similar, and just have NONLINEAR_FD be some magic value (for example, 
  make it be 0x4000 - the bit that says private, nonlinear in the 
  first place).
  
  But what's gotten lost in the current discussion is that we probably 
  don't actually _need_ such a private space. I'm just saying that if 
  the *choice* is between memory-mapped interfaces and a private 
  fd-space, we should probably go for the latter. Everything is a file 
  is the UNIX way, after all. But there's little reason to introduce 
  private fd's otherwise.
 
 it's both a flexibility and a speedup thing as well:
 
 flexibility: for libraries to be able to open files and keep them open 
 comes up regularly. For example currently glibc is quite wasteful in a 
 number of common networking related functions (Ulrich, please correct me 
 if i'm wrong), which could be optimized if glibc could just keep a 
 netlink channel fd open and could poll() it for changes and cache the 
 results if there are no changes (or something like that).
 
 speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
 non-linear fds are cheaper to allocate/map:
 
   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
 
 (i definitely remember having written code for that too, but i cannot 
 find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap 
 overhead as well and use a per-process list/pool of struct file buffers 
 plus a maximum-fd field as the 'non-linear fd allocator' (at the price 
 of only deallocating them at process exit time).

Only very few apps need to open more than 100.000 files.

As these files are likely sockets, O_ANY is not a solution.

A trick is to try to keep first 64 handles freed, so that kernel wont consume
too much cpu time and cache in get_unused_fd()

http://lkml.org/lkml/2005/9/15/307

This trick is portable (not linux centric).

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Albert Cahalan

Ingo Molnar writes:


looking over the list of our new generic APIs (see further below) i
think there are three important things that are needed for an API to
become widely used:

 1) it should solve a real problem (ha ;-), it should be intuitive to
humans and it should fit into existing things naturally.

 2) it should be ubiquitous. (if it's about IO it should cover block IO,
network IO, timers, signals and everything) Even if it might look
silly in some of the cases, having complete, utter, no compromises,
100% coverage for everything massively helps the uptake of an API,
because it allows the user-space coder to pick just one paradigm
that is closest to his application and stick to it and only to it.

 3) it should be end-to-end supported by glibc.


4) At least slightly portable.

Anything supported by any similar OS is already ahead, even if it
isn't the perfect API of our dreams. This means kqueue and doors.

If it's not on any BSD or UNIX, then most app developers won't
touch it. Worse yet, it won't appear in programming books, so even
the Linux-only app programmers won't know about it.

Running ideas by the FreeBSD and OpenSolaris developers wouldn't
be a bad idea. Agreement leads to standardization, which leads to
interfaces getting used.

BTW, wrapper libraries that bury the new API under a layer of
gunk are not helpful. One might as well just use the old API.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 it's both a flexibility and a speedup thing as well:
 
 flexibility: for libraries to be able to open files and keep them open 
 comes up regularly. For example currently glibc is quite wasteful in a 
 number of common networking related functions (Ulrich, please correct 
 me if i'm wrong), which could be optimized if glibc could just keep a 
 netlink channel fd open and could poll() it for changes and cache the 
 results if there are no changes (or something like that).
 
 speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
 non-linear fds are cheaper to allocate/map:
 
   http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
 
 (i definitely remember having written code for that too, but i cannot 
 find that in the archives. hm.) In theory we could avoid _all_ 
 fd-bitmap overhead as well and use a per-process list/pool of struct 
 file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
 (at the price of only deallocating them at process exit time).

to measure this i've written fd-scale-bench.c:

   http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c

which tests the (cache-hot or cache-cold) cost of open()-ing of two fds 
while there are N other fds already open: one is from the 'middle' of 
the range, one is from the end of it.

Lets check our current 'extreme high end' performance with 1 million 
fds. (which is not realistic right now but there certainly are systems 
with over a hundred thousand open fds). Results from a fast CPU with 2MB 
of cache:

 cache-hot:

 # ./fd-scale-bench 100 0
 checking the cache-hot performance of open()-ing 100 fds.
 num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us
 num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us
 num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us
 num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us
 ...
 num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us
 num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us
 num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us
 num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us
 num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us
 num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us
 num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us
 num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us
 num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us
 num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us
 num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us
 num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us
 num_fds: 100, best cost: 8.20 us, worst cost: 9.60 us

 cache-cold:

 # ./fd-scale-bench 100 1
 checking the performance of open()-ing 100 fds.
 num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us
 num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us
 ...
 num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us
 num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us
 num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us
 num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us
 num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us
 num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us
 num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us
 num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us
 num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us
 num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us
 num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us
 num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us
 num_fds: 100, best cost: 13.60 us, worst cost: 15.40 us

we are pretty good at the moment: the open() cost starts to increase at 
around 100K open fds, both in the cache-cold and cache-hot case. (that 
roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At 
1 million fds our fd bitmap has a size of 128K when there are 1 million 
fds open in a single process.

so while it's certainly not 'urgent' to improve this, private fds are an 
easier target for optimizations in this area, because they dont have the 
continuity requirement anymore, so the fd bitmap is not a 'forced' 
property of them.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Eric Dumazet [EMAIL PROTECTED] wrote:

  speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
  non-linear fds are cheaper to allocate/map:
  
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
  
  (i definitely remember having written code for that too, but i 
  cannot find that in the archives. hm.) In theory we could avoid 
  _all_ fd-bitmap overhead as well and use a per-process list/pool of 
  struct file buffers plus a maximum-fd field as the 'non-linear fd 
  allocator' (at the price of only deallocating them at process exit 
  time).
 
 Only very few apps need to open more than 100.000 files.

yes. I did not list it as a primary reason for private fds, it's just a 
nice side-effect. As long as the other apps are not hurt, i see no 
problem in improving the 100K open files case.

 As these files are likely sockets, O_ANY is not a solution.

why not? It would be a natural thing to extend sys_socket() with a 
'flags' parameter and pass in O_ANY (along with any other possible fd 
parameter like O_NDELAY, which could be inherited over connect()).

 A trick is to try to keep first 64 handles freed, so that kernel wont 
 consume too much cpu time and cache in get_unused_fd()
 
 http://lkml.org/lkml/2005/9/15/307

this is basically a user-space front-end cache to fd allocation - which 
duplicates data needlessly. I dont see any problem with doing this in 
the kernel. (Also, obviously 'first 64 handles' could easily break with 
certain types of apps so glibc cannot do this.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 (i definitely remember having written code for that too, but i cannot 
 find that in the archives. hm.) In theory we could avoid _all_ 
 fd-bitmap overhead as well and use a per-process list/pool of struct 
 file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
 (at the price of only deallocating them at process exit time).

btw., this also allows mostly-lockless fd allocation, which would 
probably benefit threaded apps too. (we can just recycle it from a 
per-CPU list of cached fds for that process)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Ingo Molnar wrote:
 
 * Ingo Molnar [EMAIL PROTECTED] wrote:
 
  (i definitely remember having written code for that too, but i cannot 
  find that in the archives. hm.) In theory we could avoid _all_ 
  fd-bitmap overhead as well and use a per-process list/pool of struct 
  file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
  (at the price of only deallocating them at process exit time).
 
 btw., this also allows mostly-lockless fd allocation, which would 
 probably benefit threaded apps too. (we can just recycle it from a 
 per-CPU list of cached fds for that process)

See also:

http://lkml.org/lkml/2006/6/16/144

which originates from a much simpler patch I did to fix performance
regressions in this area for the SLES10 kernel.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Albert Cahalan [EMAIL PROTECTED] wrote:

 Ingo Molnar writes:
 
 looking over the list of our new generic APIs (see further below) i
 think there are three important things that are needed for an API to
 become widely used:
 
  1) it should solve a real problem (ha ;-), it should be intuitive to
 humans and it should fit into existing things naturally.
 
  2) it should be ubiquitous. (if it's about IO it should cover block IO,
 network IO, timers, signals and everything) Even if it might look
 silly in some of the cases, having complete, utter, no compromises,
 100% coverage for everything massively helps the uptake of an API,
 because it allows the user-space coder to pick just one paradigm
 that is closest to his application and stick to it and only to it.
 
  3) it should be end-to-end supported by glibc.
 
 4) At least slightly portable.
 
 Anything supported by any similar OS is already ahead, even if it 
 isn't the perfect API of our dreams. [...]

it might have been so a few years ago but it's changing slowly but 
surely - BSD is becoming more and more irrelevant. What matters mostly 
to app writers these days: is it in most Linux distros - and the key 
to that is upstream kernel support and glibc support. The days of BSD 
(and UNIX) are pretty much numbered. (I'm not against standardizing APIs 
in POSIX of course - the BSDs tend to follow the Linux APIs in that area 
with a few years lag.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Eric Dumazet
On Thu, 31 May 2007 11:02:52 +0200
Ingo Molnar [EMAIL PROTECTED] wrote:

 
 * Ingo Molnar [EMAIL PROTECTED] wrote:
 
  it's both a flexibility and a speedup thing as well:
  
  flexibility: for libraries to be able to open files and keep them open 
  comes up regularly. For example currently glibc is quite wasteful in a 
  number of common networking related functions (Ulrich, please correct 
  me if i'm wrong), which could be optimized if glibc could just keep a 
  netlink channel fd open and could poll() it for changes and cache the 
  results if there are no changes (or something like that).
  
  speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
  non-linear fds are cheaper to allocate/map:
  
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html
  
  (i definitely remember having written code for that too, but i cannot 
  find that in the archives. hm.) In theory we could avoid _all_ 
  fd-bitmap overhead as well and use a per-process list/pool of struct 
  file buffers plus a maximum-fd field as the 'non-linear fd allocator' 
  (at the price of only deallocating them at process exit time).
 
 to measure this i've written fd-scale-bench.c:
 
http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c
 
 which tests the (cache-hot or cache-cold) cost of open()-ing of two fds 
 while there are N other fds already open: one is from the 'middle' of 
 the range, one is from the end of it.
 
 Lets check our current 'extreme high end' performance with 1 million 
 fds. (which is not realistic right now but there certainly are systems 
 with over a hundred thousand open fds). Results from a fast CPU with 2MB 
 of cache:
 
  cache-hot:
 
  # ./fd-scale-bench 100 0
  checking the cache-hot performance of open()-ing 100 fds.
  num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us
  num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us
  num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us
  num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us
  ...
  num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us
  num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us
  num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us
  num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us
  num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us
  num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us
  num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us
  num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us
  num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us
  num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us
  num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us
  num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us
  num_fds: 100, best cost: 8.20 us, worst cost: 9.60 us
 
  cache-cold:
 
  # ./fd-scale-bench 100 1
  checking the performance of open()-ing 100 fds.
  num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us
  num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us
  ...
  num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us
  num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us
  num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us
  num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us
  num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us
  num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us
  num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us
  num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us
  num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us
  num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us
  num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us
  num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us
  num_fds: 100, best cost: 13.60 us, worst cost: 15.40 us
 
 we are pretty good at the moment: the open() cost starts to increase at 
 around 100K open fds, both in the cache-cold and cache-hot case. (that 
 roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At 
 1 million fds our fd bitmap has a size of 128K when there are 1 million 
 fds open in a single process.
 
 so while it's certainly not 'urgent' to improve this, private fds are an 
 easier target for optimizations in this area, because they dont have the 
 continuity requirement anymore, so the fd bitmap is not a 'forced' 
 property of them.

Your numbers do not match mines (mines were more than two years old so I redid 
a test before replying)

I tried your bench and found two problems :
- You scan half of the bitmap
- You incorrectlty divide best_delta and worst_delta by LOOPS (5)

Try to close not a 'middle fd', but a really low one (10 for example), and 
latencie is doubled.

with a corrected bench; cache-cold numbers are  100 us on this Intel Pentium-M

num_fds: 100, best cost: 120.00 us, worst cost: 131.00 us

On an Opteron x86_64 machine, results are better :)

num_fds: 100, best cost: 28.00 us, worst cost: 106.00 us
-
To unsubscribe from this list: send the line 

Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Ingo Molnar

* Eric Dumazet [EMAIL PROTECTED] wrote:

 I tried your bench and found two problems :
 - You scan half of the bitmap
[...]
 Try to close not a 'middle fd', but a really low one (10 for example), 
 and latencie is doubled.

that was intentional. I really didnt want to fabricate a worst-case 
result but something more representative: in real apps the bitmap isnt 
fully filled all the time and most of the find-bit sequences are short. 
Hence the two fds and one of them goes from the middle of the range.

 - You incorrectlty divide best_delta and worst_delta by LOOPS (5)

ah, indeed, that's a bug - victim of a last minute edit :) Since the 
divident is constant it doesnt really matter to the validity of the 
relative nature of the slowdown (which is what i was intested in), but 
you are right - i have fixed the download and have redone the numbers. 
Here are the correct results from my box:

 # ./fd-scale-bench 100 0
 checking the cache-hot performance of open()-ing 100 fds.
 num_fds: 1, best cost: 6.00 us, worst cost: 8.00 us
 num_fds: 2, best cost: 6.00 us, worst cost: 7.00 us
 ...
 num_fds: 31586, best cost: 7.00 us, worst cost: 8.00 us
 num_fds: 39483, best cost: 8.00 us, worst cost: 8.00 us
 num_fds: 49354, best cost: 7.00 us, worst cost: 9.00 us
 num_fds: 61693, best cost: 8.00 us, worst cost: 10.00 us
 num_fds: 77117, best cost: 8.00 us, worst cost: 13.00 us
 num_fds: 96397, best cost: 9.00 us, worst cost: 11.00 us
 num_fds: 120497, best cost: 10.00 us, worst cost: 14.00 us
 num_fds: 150622, best cost: 11.00 us, worst cost: 13.00 us
 num_fds: 188278, best cost: 12.00 us, worst cost: 15.00 us
 num_fds: 235348, best cost: 14.00 us, worst cost: 20.00 us
 num_fds: 294186, best cost: 16.00 us, worst cost: 22.00 us
 num_fds: 367733, best cost: 19.00 us, worst cost: 35.00 us
 num_fds: 459667, best cost: 22.00 us, worst cost: 37.00 us
 num_fds: 574584, best cost: 26.00 us, worst cost: 40.00 us
 num_fds: 718231, best cost: 31.00 us, worst cost: 62.00 us
 num_fds: 897789, best cost: 37.00 us, worst cost: 54.00 us
 num_fds: 100, best cost: 41.00 us, worst cost: 59.00 us

and cache-cold:

 # ./fd-scale-bench 100 1
 checking the cache-cold performance of open()-ing 100 fds.
 num_fds: 1, best cost: 24.00 us, worst cost: 32.00 us
 ...
 num_fds: 49354, best cost: 26.00 us, worst cost: 28.00 us
 num_fds: 61693, best cost: 25.00 us, worst cost: 30.00 us
 num_fds: 77117, best cost: 27.00 us, worst cost: 30.00 us
 num_fds: 96397, best cost: 27.00 us, worst cost: 31.00 us
 num_fds: 120497, best cost: 31.00 us, worst cost: 43.00 us
 num_fds: 150622, best cost: 31.00 us, worst cost: 34.00 us
 num_fds: 188278, best cost: 33.00 us, worst cost: 36.00 us
 num_fds: 235348, best cost: 35.00 us, worst cost: 42.00 us
 num_fds: 294186, best cost: 36.00 us, worst cost: 41.00 us
 num_fds: 367733, best cost: 40.00 us, worst cost: 43.00 us
 num_fds: 459667, best cost: 44.00 us, worst cost: 46.00 us
 num_fds: 574584, best cost: 48.00 us, worst cost: 65.00 us
 num_fds: 718231, best cost: 54.00 us, worst cost: 59.00 us
 num_fds: 897789, best cost: 60.00 us, worst cost: 62.00 us
 num_fds: 100, best cost: 65.00 us, worst cost: 68.00 us

 with a corrected bench; cache-cold numbers are  100 us on this Intel 
 Pentium-M
 
 num_fds: 100, best cost: 120.00 us, worst cost: 131.00 us
 
 On an Opteron x86_64 machine, results are better :)
 
 num_fds: 100, best cost: 28.00 us, worst cost: 106.00 us

yeah. I quoted the full range because i was really more interested of 
our current 'limit' range (which is somewhere between 50K and 100K open 
fds) where the scanning cost becomes directly measurable, and the nature 
of slowdown.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread William Lee Irwin III
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote:
>> Which *could* be something as simple as saying "bit 30 in the file 
>> descriptor specifies a separate fd space" along with some flags to make 
>> open and friends return those separate fd's. That makes them useless for 
>> "select()" (which assumes a flat address space, of course), but would be 
>> useful for just about anything else.

On Wed, May 30, 2007 at 05:27:15PM -0500, Matt Mackall wrote:
> Or.. we could have a method of swizzling in and out an entire FD
> array, similar to UML's trick for swizzling MMs.

I like that notion even better than randomization. I think it should
happen. I like SKAS, too, of course.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Matt Mackall
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote:
> Which *could* be something as simple as saying "bit 30 in the file 
> descriptor specifies a separate fd space" along with some flags to make 
> open and friends return those separate fd's. That makes them useless for 
> "select()" (which assumes a flat address space, of course), but would be 
> useful for just about anything else.

Or.. we could have a method of swizzling in and out an entire FD
array, similar to UML's trick for swizzling MMs.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread William Lee Irwin III
On Wed, May 30, 2007 at 02:27:52PM -0700, Linus Torvalds wrote:
> Well, don't think of it as a special case at all: think of bit 30 as a 
> "the user asked for a non-linear fd".
> In fact, to make it effective, I'd suggest literally scrambling the low 
> bits (using, for example, some silly per-boot xor value to to actually 
> generate the "true" index - the equivalent of a really stupid randomizer). 
> That way you'd have the legacy "linear" space, and a separate "non-linear 
> space" where people simply *cannot* make assumptions about contiguous fd 
> allocations. There's no special case there - it's just an extension which 
> explicitly allows us to say "if you do that, your fd's won't be allocated 
> the traditional way any more, but you *can* mix the traditional and the 
> non-linear allocation".

One could always stuff a seed or per-cpu seeds in the files_struct and
use a PRNG. The only trick would be cacheline bounces and/or space
consumption of seeds. Another possibility would be bitreversed
contiguity or otherwise a bit permutation of some contiguous range,
modulo (of course) the high bit used to tag the randomized range.

With "truly" random/sparse fd numbers it may be meaningful to use a
different data structure from a bitmap to track them in-kernel, though
xor and other easily-computed mappings to/from contiguous ranges won't
need such in earnest.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread David M. Lloyd
On Wed, 30 May 2007 14:27:52 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> Well, don't think of it as a special case at all: think of bit 30 as
> a "the user asked for a non-linear fd".

If the sole point is to protect an fd from being closed or operated on
outside of a certain context, why not just provide the ability to
"protect" an fd to prevent its use.  Maybe a pair of syscalls like
"fdprotect" and "fdunprotect" that take an fd and an integer key.
Protected fds would return EBADF or something if accessed.  The same
integer key must be provided to fdunprotect in order to gain access
to it again.  Then glibc or valgrind or whatever would just unprotect
the fd before operating on it.

- DML
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Davide Libenzi a écrit :

On Wed, 30 May 2007, Linus Torvalds wrote:


And then the semantics: do these descriptors should show up in
/proc/self/fd?  Are there separate directories for each namespace?  Do
they count against the rlimit?
Oh, absolutely. The'd be real fd's in every way. People could use them 
100% equivalently (and concurrently) with the traditional ones. The whole, 
and the _only_ point, would be that it breaks the legacy guarantees of a 
dense fd space.


Most apps don't actually *need* that dense fd space in any case. But by 
defaulting to it, we wouldn't break those (few) apps that actually depend 
on it.


I agree. What would be a good interface to allocate fds in such area? We 
don't want to replicate syscalls, so maybe a special new dup function?




If the deal is to be able to get faster open()/socket()/pipe()/... calls by 
not finding the first 0 bit in a huge bitmap, a better way would be to have a 
flag in struct task, reset to 0 at exec time.


A new syscall would say : This process is OK to receive *random* fds.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Ulrich Drepper wrote:

> You also have to be aware that open() is just one piece of the puzzle.
> What about socket()?  I've cursed this interface many times before and
> now it's biting you: there is parameter to pass a flag.  What about
> transferring file descriptors via Unix domain sockets?  How can I decide
> the transferred descriptor should be in the private namespace?

Well, we can't just replicate/change every system call that creates a file 
descriptor. So I'm for something like:

int sys_fdup(int fd, int flags);

So you basically create your fds with their native/existing system calls, 
and then you dup/move them into the prefered fd space.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Davide Libenzi wrote:
> 
> I agree. What would be a good interface to allocate fds in such area? We 
> don't want to replicate syscalls, so maybe a special new dup function?

I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or similar, 
and just have NONLINEAR_FD be some magic value (for example, make it be 
0x4000 - the bit that says "private, nonlinear" in the first place).

But what's gotten lost in the current discussion is that we probably don't 
actually _need_ such a private space. I'm just saying that if the *choice* 
is between memory-mapped interfaces and a private fd-space, we should 
probably go for the latter. "Everything is a file" is the UNIX way, after 
all. But there's little reason to introduce private fd's otherwise.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Eric Dumazet wrote:

No, Davide, the problem is that some applications depend on getting
_specific_ file descriptors.

Fix the application, and not adding kernel bloat ?


No. The application is _correct_. It's how file descriptors are defined to 
work. 


Then you can also exclude multi-threading, since a thread (even not inside
glibc) can also use socket()/pipe()/open()/whatever and take the zero file
descriptor as well.


Totally different. That's an application internal issue. It does *not* 
mean that we can break existing standards.



The only hardcoded thing in Unix is 0, 1 and 2 fds.


Wrong. I already gave an example of real code that just didn't bother to 
keep track of which fd's it had open, and closed them all. Partly, in 
fact, because you can't even _know_ which fd's you have open when somebody 
else just execve's you.


If someone really cares, /proc/self/fd can help. But one shouldn't care at all.

About the things that the process can do before execing() a process, file 
descriptors outside of 0,1,2 are the most obvious thing, but you also have 
alarm(), or stupid rlimits.




You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. 

You cannot just change years and years of coding practice, and standard 
documentations. The behaviour of file descriptors is a fact. Ignoring that 
fact because you don't like it is naïve and simply not realistic.


I want to change nothing. Current situation is fine and well documented, thank 
you.


If a program does "for (i = 0; i < NR_OPEN; i++) close(i);", this 
*will*/*should* work as intended : close all files descriptors from 0 to 
NR_OPEN. Big deal.


But you wont find in a program :

FILE *fp = fopen("somefile", "r");
for (i = 0; i < NR_OPEN; i++)
close(i);
while (fgets(buff, sizeof(buff), fp)) {
}


You and/or others want to add fd namespaces and other hacks.

I saw on this thread suspicious examples, I am waiting for a real one, 
justifying all this stuff.


After file descriptors separation, I guess we'll need memory space separation 
as well, signal separations (SIGALRM comes to mind), uid/gid separation, cpu 
time separation, and so on... setrlimit() layered for every shared lib.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jeremy Fitzhardinge
Linus Torvalds wrote:
> Side note: it might not even be a "close-on-exec by default" thing: it 
> might well be a *always* close-on-exec.
>
> That COE is pretty horrid to do, we need to scan a bitmap of those things 
> on each exec. So it migth be totally sensible to just declare that the 
> non-linear fd's would simply always be "local", and never bleed across an 
> execve).

Hm, I wouldn't limit the mechanism prematurely.  Using Valgrind as an
example of an alternate user of this mechanism, it would be useful to
use a pipe to transmit out-of-band information from an exec-er to an
exec-ee process.  At the moment there's a lot mucking around with
execve() to transmit enough information from the parent valgrind to its
successor.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Linus Torvalds wrote:
> 
> Sure. I think there are things we can do (like make the non-linear fd's 
> appear somewhere else, and make them close-on-exec by default etc).

Side note: it might not even be a "close-on-exec by default" thing: it 
might well be a *always* close-on-exec.

That COE is pretty horrid to do, we need to scan a bitmap of those things 
on each exec. So it migth be totally sensible to just declare that the 
non-linear fd's would simply always be "local", and never bleed across an 
execve).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Linus Torvalds wrote:
> Well, don't think of it as a special case at all: think of bit 30 as a 
> "the user asked for a non-linear fd".

This sounds easy but doesn't really solve all the issues.  Let me repeat
your example and the solution currently in use:

problem: application wants to close all file descriptors except a select
few, cleaning up what is currently open.  It doesn't know all the
descriptors that are open.  Maybe all this in preparation of an exec call.

Today the best method to do this is to readdir() /proc/self/fd and
exclude the descriptors on the whitelist.

If the special, non-sequential descriptors are also listed in that
directory the runtimes still cannot use them since they are visible.

If you go ahead with this, then at the very least add a flag which
causes the descriptor to not show up in /proc/*/fd.


You also have to be aware that open() is just one piece of the puzzle.
What about socket()?  I've cursed this interface many times before and
now it's biting you: there is parameter to pass a flag.  What about
transferring file descriptors via Unix domain sockets?  How can I decide
the transferred descriptor should be in the private namespace?

There are likely many many more problems and cornercases like this.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXfD12ijCOnn/RHQRAk4nAJ0Zjevd9Y0lQa/fLzKK+BshcLVbngCfSspI
ALNKu8VCKy7CvoIqJD3Xs/Y=
=+fM8
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Linus Torvalds wrote:

> > And then the semantics: do these descriptors should show up in
> > /proc/self/fd?  Are there separate directories for each namespace?  Do
> > they count against the rlimit?
> 
> Oh, absolutely. The'd be real fd's in every way. People could use them 
> 100% equivalently (and concurrently) with the traditional ones. The whole, 
> and the _only_ point, would be that it breaks the legacy guarantees of a 
> dense fd space.
> 
> Most apps don't actually *need* that dense fd space in any case. But by 
> defaulting to it, we wouldn't break those (few) apps that actually depend 
> on it.

I agree. What would be a good interface to allocate fds in such area? We 
don't want to replicate syscalls, so maybe a special new dup function?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Jeremy Fitzhardinge wrote:
> 
> Some programs - legitimately, I think - scan /proc/self/fd to close
> everything.  The question is whether the glibc-private fds should appear
> there.  And something like a "close-on-fork" flag might be useful,
> though I guess glibc can keep track of its own fds closely enough to not
> need something like that.

Sure. I think there are things we can do (like make the non-linear fd's 
appear somewhere else, and make them close-on-exec by default etc).

And it's not like it's necessarily at all the only way to do things. 

I just threw it out as a possible solution - and one that is almost 
certainly *superior* to trying to work around the fd thing with some 
shared memory area which has tons of much more serious problems of its own 
(*).

Linus

(*) Ranging from: specialized-only interfaces, inability to pass it 
around, lack of any abstraction interfaces, and almost impossible to 
debug. The security implications of kernel and user space sharing 
read-write access to some shared area are also legion!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Ulrich Drepper wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Linus Torvalds wrote:
> > for (i = 0; i < NR_OPEN; i++)
> > close(i);
> > 
> > to clean up all file descriptors before doing something new. And yes, I 
> > think it was bash that used to *literally* do something like that a long 
> > time ago.
> 
> Indeed.  It was not only bash, though, I fixed probably a dozen
> applications.  But even the new and better solution (readdir of
> /proc/self/fd) does not prevent the problem of closing descriptors the
> system might still need and the application doesn't know about.

Please, do not drop me out of the Cc list. If you have a valid point, you 
should be able to carry it forward regardless, no?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jeremy Fitzhardinge
Ulrich Drepper wrote:
> I don't like special cases.  For me things better come in quantities 0,
> 1, and unlimited (well, reasonable high limit).  Otherwise, who gets to
> use that special namespace?  The C library is not the only body of code
> which would want to use descriptors.

Valgrind could certainly make use of it.  It currently reserves a set of
fds "high enough", and tries hard to hide them from apps, but
/proc/self/fd makes it intractable in general (there was only so much
simulation I was willing to do in Valgrind).

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jeremy Fitzhardinge
Linus Torvalds wrote:
> Which *could* be something as simple as saying "bit 30 in the file 
> descriptor specifies a separate fd space" along with some flags to make 
> open and friends return those separate fd's. That makes them useless for 
> "select()" (which assumes a flat address space, of course), but would be 
> useful for just about anything else.
>   

Some programs - legitimately, I think - scan /proc/self/fd to close
everything.  The question is whether the glibc-private fds should appear
there.  And something like a "close-on-fork" flag might be useful,
though I guess glibc can keep track of its own fds closely enough to not
need something like that.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Eric Dumazet wrote:

> > So library routines *must not* open file descriptors in the normal space.
> > 
> > (The same is true of real applications doing the equivalent of
> > 
> > for (i = 0; i < NR_OPEN; i++)
> > close(i);
> 
> Quite buggy IMHO

Looking at it now, I'd agree (although I think I have that somewhere in my 
old code too). Consider though, that such code is contained also in 
reference books like Richard Stevens "UNIX Network Programming".



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Ulrich Drepper wrote:
> 
> I don't like special cases.  For me things better come in quantities 0,
> 1, and unlimited (well, reasonable high limit).  Otherwise, who gets to
> use that special namespace?  The C library is not the only body of code
> which would want to use descriptors.

Well, don't think of it as a special case at all: think of bit 30 as a 
"the user asked for a non-linear fd".

In fact, to make it effective, I'd suggest literally scrambling the low 
bits (using, for example, some silly per-boot xor value to to actually 
generate the "true" index - the equivalent of a really stupid randomizer). 

That way you'd have the legacy "linear" space, and a separate "non-linear 
space" where people simply *cannot* make assumptions about contiguous fd 
allocations. There's no special case there - it's just an extension which 
explicitly allows us to say "if you do that, your fd's won't be allocated 
the traditional way any more, but you *can* mix the traditional and the 
non-linear allocation".

> And then the semantics: do these descriptors should show up in
> /proc/self/fd?  Are there separate directories for each namespace?  Do
> they count against the rlimit?

Oh, absolutely. The'd be real fd's in every way. People could use them 
100% equivalently (and concurrently) with the traditional ones. The whole, 
and the _only_ point, would be that it breaks the legacy guarantees of a 
dense fd space.

Most apps don't actually *need* that dense fd space in any case. But by 
defaulting to it, we wouldn't break those (few) apps that actually depend 
on it.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Linus Torvalds wrote:
>   for (i = 0; i < NR_OPEN; i++)
>   close(i);
> 
> to clean up all file descriptors before doing something new. And yes, I 
> think it was bash that used to *literally* do something like that a long 
> time ago.

Indeed.  It was not only bash, though, I fixed probably a dozen
applications.  But even the new and better solution (readdir of
/proc/self/fd) does not prevent the problem of closing descriptors the
system might still need and the application doesn't know about.


> Which *could* be something as simple as saying "bit 30 in the file 
> descriptor specifies a separate fd space" along with some flags to make 
> open and friends return those separate fd's.

I don't like special cases.  For me things better come in quantities 0,
1, and unlimited (well, reasonable high limit).  Otherwise, who gets to
use that special namespace?  The C library is not the only body of code
which would want to use descriptors.

And then the semantics: do these descriptors should show up in
/proc/self/fd?  Are there separate directories for each namespace?  Do
they count against the rlimit?

This seems to me like a shot from the hips without thinking about other
possibilities.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXemS2ijCOnn/RHQRAjsFAKCGhakZosSsRzCwOvruxECbzcwIzACeJAiY
z9ql4FJa8XTSiZzRG79ocwM=
=0E7f
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Eric Dumazet wrote:

So yes, reimplement sendfile() should help to find last splice() bugs, and as
a bonus it could add non blocking disk io, (O_NONBLOCK on input file ->
socket)


Well, to get those kinds of advantages, you'd have to use splice directly, 
since sendfile() hasn't supported nonblocking disk IO, and the interface 
doesn't really allow for it.


sendfile() interface doesnt allow it, but if you open("somediskfile", O_RDONLY 
| O_NONBLOCK), then splice() based sendfile() can perform a non blocking disk 
io, (while starting an io with readahead)


I actually use this trick myself :)

(splice(disk -> pipe, NONBLOCK), splice(pipe -> worker))

non blocking disk io, + zero copy :)



In fact, since nonblocking accesses require also some *polling* method, 
and we don't have that for files, I suspect the best option for those 
things is to simply mix AIO and splice(). AIO tends to be the right thing 
for disk waits (read: short, often cached), and if we can improve AIO 
performance for the cached accesses (which is exactly what the threadlets 
should hopefully allow us to do), I would seriously suggest going that 
route.


But the pure "use splice to _implement_ sendfile()" thing is worth doing 
for all the other reasons, even if nonblocking file access is not likely 
one of them.


Linus




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Eric Dumazet wrote:
> > 
> > No, Davide, the problem is that some applications depend on getting
> > _specific_ file descriptors.
> 
> Fix the application, and not adding kernel bloat ?

No. The application is _correct_. It's how file descriptors are defined to 
work. 

> Then you can also exclude multi-threading, since a thread (even not inside
> glibc) can also use socket()/pipe()/open()/whatever and take the zero file
> descriptor as well.

Totally different. That's an application internal issue. It does *not* 
mean that we can break existing standards.

> The only hardcoded thing in Unix is 0, 1 and 2 fds.

Wrong. I already gave an example of real code that just didn't bother to 
keep track of which fd's it had open, and closed them all. Partly, in 
fact, because you can't even _know_ which fd's you have open when somebody 
else just execve's you.

You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. 

You cannot just change years and years of coding practice, and standard 
documentations. The behaviour of file descriptors is a fact. Ignoring that 
fact because you don't like it is naïve and simply not realistic.

Linus

Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Davide Libenzi wrote:
Here I think we are forgetting that glibc is userspace and there's no 
separation between the application code and glibc code. An application 
linking to glibc can break glibc in thousand ways, indipendently from fds 
or not fds. Like complaining that glibc is broken because printf() 
suddendly does not work anymore ;)


No, Davide, the problem is that some applications depend on getting 
_specific_ file descriptors.




Fix the application, and not adding kernel bloat ?


For example, if you do

close(0);
.. something else ..
if (open("myfile", O_RDONLY) < 0)
exit(1);

you can (and should) depend on the open returning zero.


Then you can also exclude multi-threading, since a thread (even not inside 
glibc) can also use socket()/pipe()/open()/whatever and take the zero file 
descriptor as well.


Frankly I dont buy this fd namespace stuff.

The only hardcoded thing in Unix is 0, 1 and 2 fds.
People usually take care of these, or should use a Microsoft OS.

POSIX mandates that open() returns the lowest available fd.
But this obviously works only if you dont have another thread messing with 
fds, or if you dont call a library function that opens a file.


Thats all.



So library routines *must not* open file descriptors in the normal space.

(The same is true of real applications doing the equivalent of

for (i = 0; i < NR_OPEN; i++)
close(i);


Quite buggy IMHO

This hack was to avoid bugs coming from ancestors applications, 
forking/execing a shell, and at times where one process could not open more 
than 20 files (AT Unix, 21 years ago)


Unix has fcntl(fd, F_SETFD, FD_CLOEXEC). A library should use this to make 
sure fd is not propagated at exec() time.




to clean up all file descriptors before doing something new. And yes, I 
think it was bash that used to *literally* do something like that a long 
time ago.


Another example of the same thing: people open file descriptors and know 
that they'll be "dense" in the result, and then use "select()" on them.


poll() is nice. Even AT Unix had it 21 years ago :)



So it's true that file descriptors can't be used randomly by the standard 
libraries - they'd need to have some kind of separate "private space".


Which *could* be something as simple as saying "bit 30 in the file 
descriptor specifies a separate fd space" along with some flags to make 
open and friends return those separate fd's. That makes them useless for 
"select()" (which assumes a flat address space, of course), but would be 
useful for just about anything else.




Please dont do that. Second class fds.

Then what about having ten different shared libraries ? Third class fds ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Eric Dumazet wrote:
> 
> So yes, reimplement sendfile() should help to find last splice() bugs, and as
> a bonus it could add non blocking disk io, (O_NONBLOCK on input file ->
> socket)

Well, to get those kinds of advantages, you'd have to use splice directly, 
since sendfile() hasn't supported nonblocking disk IO, and the interface 
doesn't really allow for it.

In fact, since nonblocking accesses require also some *polling* method, 
and we don't have that for files, I suspect the best option for those 
things is to simply mix AIO and splice(). AIO tends to be the right thing 
for disk waits (read: short, often cached), and if we can improve AIO 
performance for the cached accesses (which is exactly what the threadlets 
should hopefully allow us to do), I would seriously suggest going that 
route.

But the pure "use splice to _implement_ sendfile()" thing is worth doing 
for all the other reasons, even if nonblocking file access is not likely 
one of them.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Linus Torvalds wrote:

> On Wed, 30 May 2007, Davide Libenzi wrote:
> > 
> > Here I think we are forgetting that glibc is userspace and there's no 
> > separation between the application code and glibc code. An application 
> > linking to glibc can break glibc in thousand ways, indipendently from fds 
> > or not fds. Like complaining that glibc is broken because printf() 
> > suddendly does not work anymore ;)
> 
> No, Davide, the problem is that some applications depend on getting 
> _specific_ file descriptors.
> 
> For example, if you do
> 
>   close(0);
>   .. something else ..
>   if (open("myfile", O_RDONLY) < 0)
>   exit(1);
> 
> you can (and should) depend on the open returning zero.
> 
> So library routines *must not* open file descriptors in the normal space.
> 
> (The same is true of real applications doing the equivalent of
> 
>   for (i = 0; i < NR_OPEN; i++)
>   close(i);
> 
> to clean up all file descriptors before doing something new. And yes, I 
> think it was bash that used to *literally* do something like that a long 
> time ago.

Right. I misunderstood Uli and Ingo. I thought it was like trying to 
protect glibc from intentional application mis-behaviour.



> Another example of the same thing: people open file descriptors and know 
> that they'll be "dense" in the result, and then use "select()" on them.
> 
> So it's true that file descriptors can't be used randomly by the standard 
> libraries - they'd need to have some kind of separate "private space".
> 
> Which *could* be something as simple as saying "bit 30 in the file 
> descriptor specifies a separate fd space" along with some flags to make 
> open and friends return those separate fd's. That makes them useless for 
> "select()" (which assumes a flat address space, of course), but would be 
> useful for just about anything else.

I think it can be solved in a few ways. Yours or Ingo's (or something 
else) can work, to solve the above "legacy" fd space expectations.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Mark Lord wrote:

I wonder how useful it would be to reimplement sendfile()
using splice(), either in glibc or inside the kernel itself?


I'd like that, if only because right now we have two separate paths that 
kind of do the same thing, and splice really is the only one that is 
generic.


I thought Jens even had some experimental patches for it. It might be 
worth to "just do it" - there's some internal overhead, but on the other 
hand, it's also likely the best way to make sure any issues get sorted 
out.




Last time I played with splice(), I found a bug with readahead logic, most 
probably because nobody but me tried it before.


(corrected by Fengguang Wu in commit 9ae9d68cbf3fe0ec17c17c9ecaa2188ffb854a66 )

So yes, reimplement sendfile() should help to find last splice() bugs, and as 
a bonus it could add non blocking disk io, (O_NONBLOCK on input file -> socket)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Davide Libenzi wrote:
> 
> Here I think we are forgetting that glibc is userspace and there's no 
> separation between the application code and glibc code. An application 
> linking to glibc can break glibc in thousand ways, indipendently from fds 
> or not fds. Like complaining that glibc is broken because printf() 
> suddendly does not work anymore ;)

No, Davide, the problem is that some applications depend on getting 
_specific_ file descriptors.

For example, if you do

close(0);
.. something else ..
if (open("myfile", O_RDONLY) < 0)
exit(1);

you can (and should) depend on the open returning zero.

So library routines *must not* open file descriptors in the normal space.

(The same is true of real applications doing the equivalent of

for (i = 0; i < NR_OPEN; i++)
close(i);

to clean up all file descriptors before doing something new. And yes, I 
think it was bash that used to *literally* do something like that a long 
time ago.

Another example of the same thing: people open file descriptors and know 
that they'll be "dense" in the result, and then use "select()" on them.

So it's true that file descriptors can't be used randomly by the standard 
libraries - they'd need to have some kind of separate "private space".

Which *could* be something as simple as saying "bit 30 in the file 
descriptor specifies a separate fd space" along with some flags to make 
open and friends return those separate fd's. That makes them useless for 
"select()" (which assumes a flat address space, of course), but would be 
useful for just about anything else.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Davide Libenzi wrote:
> An application 
> linking to glibc can break glibc in thousand ways, indipendently from fds 
> or not fds.

It's not (only/mainly) about breaking.  File descriptors are a resources
which has to be used under the control of the program.  The runtime
cannot just steal some for itself.  This indirectly leads to breaking
code.  We've seen this many times and I keep repeating the same issue
over and over again: why do we have MAP_ANON instead of keeping a file
descriptor with /dev/null open?  Why is mmap made more complicated by
allowing the file descriptor to be closed after the mmap() call is done?

Take a look at a process running your favorite shell.  Ever wonder why
there is this stray file descriptor with a high number?

$ cat /proc/3754/cmdline
bash
$ ll /proc/3754/fd/
total 0
lrwx-- 1 drepper drepper 64 2007-05-30 12:50 0 -> /dev/pts/19
lrwx-- 1 drepper drepper 64 2007-05-30 12:50 1 -> /dev/pts/19
lrwx-- 1 drepper drepper 64 2007-05-30 12:49 2 -> /dev/pts/19
lrwx-- 1 drepper drepper 64 2007-05-30 12:50 255 -> /dev/pts/19

File descriptors must be requested explicitly and cannot be implicitly
consumed.

All that and the other problem I mentioned earlier today about auxiliary
data.  File descriptors are not the ideal interface.  Elegant: yes,
ideal: no.  Fro physics and math you might have learned that not every
result that looks clean and beautiful is correct.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXdbC2ijCOnn/RHQRAgBbAJ0RoNsQr4L6Bm5hLy7somAKeTqCcQCbBHmx
8hzG+1w0rYMTqXxNmi/QQ7o=
=O7Xm
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Ingo Molnar wrote:

> 
> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > > To echo Uli and paraphrase an ad, "it's the interface, silly."
> > 
> > THERE IS NO INTERFACE! You're just making that up, and glossing over 
> > the most important part of the whole thing!
> > 
> > If you could actually point to something specific that matches what 
> > everybody needs, and is architecture-neutral, it would be a different 
> > issue. As is, you're just saying "memory-mapped interfaces" without 
> > actually going into enough detail to show HOW MUCH IT SUCKS.
> > 
> > There really are very few programs that would use them. [...]
> 
> looking over the list of our new generic APIs (see further below) i 
> think there are three important things that are needed for an API to 
> become widely used:
> 
>  1) it should solve a real problem (ha ;-), it should be intuitive to 
> humans and it should fit into existing things naturally.
> 
>  2) it should be ubiquitous. (if it's about IO it should cover block IO,
> network IO, timers, signals and everything) Even if it might look
> silly in some of the cases, having complete, utter, no compromises,
> 100% coverage for everything massively helps the uptake of an API, 
> because it allows the user-space coder to pick just one paradigm 
> that is closest to his application and stick to it and only to it.
> 
>  3) it should be end-to-end supported by glibc.
> 
> our failed API attempts so far were:
> 
>  - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
>(couldnt be used in certain types of scenarios so was unintuitive.)
>splice() fixes this almost completely.
> 
>  - KAIO. It fails on #2 and #3.
> 
> our more successful new APIs:
> 
>  - futexes. After some hickups they form the base of all modern 
>user-space locking.
> 
>  - splice. (a bit too early to tell but it's looking good so far. Would
>be nice if someone did a brute-force memcpy() based vmsplice to user
>memory, just to make usage fully symmetric.)
> 
> partially successful, not yet failed new APIs:
> 
>  - epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but
>not completely). Despite the non-complete coverage of event domains a
>good number of apps are using it, and in particular a couple really
>'high end' apps with massive amounts of event sources - which apps 
>would have no chance with poll, select or threads.
> 
>  - inotify. It's being used quite happily on the desktop, despite some
>of its limitations. (Possibly integratable into epoll?)

I think, as Linus pointed out (as I did a few months ago), that there's 
confusion about the term "Unification" or "Single Interface".
Unification is not about fetching all the data coming from the more 
diverse sources, into a single interface. That is just broken, because 
each data source wants a different data structure to be reported. 
This is ABI-hell 101. Unification is the ability to uniformly wait for 
readiness, and then fetch data with source-dependent collectors (read(2), 
io_getvents(2), ...). That way you have ABI isolation on the single 
data source, and not monster structures trying to blob together the more 
diverse data formats.
AFAIK, inotify works with select/poll/epoll as is.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Ingo Molnar wrote:

> yeah - this is a fundamental design question for Linus i guess :-) glibc 
> (and other infrastructure libraries) have a fundamental problem: they 
> cannot (and do not) presently use persistent file descriptors to make 
> use of kernel functionality, due to ABI side-effects. [applications can 
> dup into an fd used by glibc, applications can close it - shells close 
> fds blindly for example, etc.] Today glibc simply cannot open a file 
> descriptor and keep it open while application code is running due to 
> these problems.

Here I think we are forgetting that glibc is userspace and there's no 
separation between the application code and glibc code. An application 
linking to glibc can break glibc in thousand ways, indipendently from fds 
or not fds. Like complaining that glibc is broken because printf() 
suddendly does not work anymore ;)

#include 
int main(void) {
close(fileno(stdout));
printf("Whiskey Tango Foxtrot?\n");
return 0;
}



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jens Axboe
On Wed, May 30 2007, Linus Torvalds wrote:
> 
> 
> On Wed, 30 May 2007, Mark Lord wrote:
> >
> > I wonder how useful it would be to reimplement sendfile()
> > using splice(), either in glibc or inside the kernel itself?
> 
> I'd like that, if only because right now we have two separate paths that 
> kind of do the same thing, and splice really is the only one that is 
> generic.
> 
> I thought Jens even had some experimental patches for it. It might be 
> worth to "just do it" - there's some internal overhead, but on the other 
> hand, it's also likely the best way to make sure any issues get sorted 
> out.

I do, this is a one year old patch that does that:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=f8f550e027fd07ad8d87110178803dc63b544d89

I'll update it, test, and submit for 2.6.23.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Mark Lord wrote:
>
> I wonder how useful it would be to reimplement sendfile()
> using splice(), either in glibc or inside the kernel itself?

I'd like that, if only because right now we have two separate paths that 
kind of do the same thing, and splice really is the only one that is 
generic.

I thought Jens even had some experimental patches for it. It might be 
worth to "just do it" - there's some internal overhead, but on the other 
hand, it's also likely the best way to make sure any issues get sorted 
out.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jens Axboe
On Wed, May 30 2007, Mark Lord wrote:
> Ingo Molnar wrote:
> >
> > - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
> >   (couldnt be used in certain types of scenarios so was unintuitive.)
> >   splice() fixes this almost completely.
> >
> > - KAIO. It fails on #2 and #3.
> 
> I wonder how useful it would be to reimplement sendfile()
> using splice(), either in glibc or inside the kernel itself?
> 
> sendfile() does get used a fair bit, but I really doubt that anyone
> outside of a handful of people on this list actually use splice().

It's indeed the plan, I even have git branch for it. Just never took the
time to actually finish it.

http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=splice-sendfile

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Mark Lord

Ingo Molnar wrote:


 - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
   (couldnt be used in certain types of scenarios so was unintuitive.)
   splice() fixes this almost completely.

 - KAIO. It fails on #2 and #3.


I wonder how useful it would be to reimplement sendfile()
using splice(), either in glibc or inside the kernel itself?

sendfile() does get used a fair bit, but I really doubt that anyone
outside of a handful of people on this list actually use splice().

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jens Axboe
On Wed, May 30 2007, Ingo Molnar wrote:
>  - splice. (a bit too early to tell but it's looking good so far. Would
>be nice if someone did a brute-force memcpy() based vmsplice to user
>memory, just to make usage fully symmetric.)

Heh, I actually agree, at least then the interface is complete! We can
always replace it with something more clever, should someone feel so
inclined. Here's a rough patch to do that, it's totally untested (but it
compiles). sparse will warn about the __user removal, though. I'm sure
viro would shoot me dead on the spot, should he see this...

diff --git a/fs/splice.c b/fs/splice.c
index 12f2828..5023c01 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -657,9 +657,9 @@ out_ret:
  * key here is the 'actor' worker passed in that actually moves the data
  * to the wanted destination. See pipe_to_file/pipe_to_sendpage above.
  */
-ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
-  struct file *out, loff_t *ppos, size_t len,
-  unsigned int flags, splice_actor *actor)
+ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, void *actor_priv,
+  loff_t *ppos, size_t len, unsigned int flags,
+  splice_actor *actor)
 {
int ret, do_wakeup, err;
struct splice_desc sd;
@@ -669,7 +669,7 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
 
sd.total_len = len;
sd.flags = flags;
-   sd.file = out;
+   sd.file = actor_priv;
sd.pos = *ppos;
 
for (;;) {
@@ -1240,28 +1240,104 @@ static int get_iovec_page_array(const struct iovec 
__user *iov,
return error;
 }
 
+static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+   struct splice_desc *sd)
+{
+   int ret;
+
+   ret = buf->ops->pin(pipe, buf);
+   if (!ret) {
+   void __user *dst = sd->userptr;
+   /*
+* use non-atomic map, can be optimized to map atomically if we
+* prefault the user memory.
+*/
+   char *src = buf->ops->map(pipe, buf, 0);
+
+   if (copy_to_user(dst, src, sd->len))
+   ret = -EFAULT;
+
+   buf->ops->unmap(pipe, buf, src);
+
+   if (!ret)
+   return sd->len;
+   }
+
+   return ret;
+}
+
+/*
+ * For lack of a better implementation, implement vmsplice() to userspace
+ * as a simple copy of the pipes pages to the user iov.
+ */
+static long vmsplice_to_user(struct file *file, const struct iovec __user *iov,
+unsigned long nr_segs, unsigned int flags)
+{
+   struct pipe_inode_info *pipe;
+   ssize_t size;
+   int error;
+   long ret;
+
+   pipe = pipe_info(file->f_path.dentry->d_inode);
+   if (!pipe)
+   return -EBADF;
+   if (!nr_segs)
+   return 0;
+
+   if (pipe->inode)
+   mutex_lock(>inode->i_mutex);
+
+   ret = 0;
+   while (nr_segs) {
+   void __user *base;
+   size_t len;
+
+   /*
+* Get user address base and length for this iovec.
+*/
+   error = get_user(base, >iov_base);
+   if (unlikely(error))
+   break;
+   error = get_user(len, >iov_len);
+   if (unlikely(error))
+   break;
+
+   /*
+* Sanity check this iovec. 0 read succeeds.
+*/
+   if (unlikely(!len))
+   break;
+   error = -EFAULT;
+   if (unlikely(!base))
+   break;
+
+   size = __splice_from_pipe(pipe, (void *) base, NULL, len,
+   flags, pipe_to_user);
+   if (size < 0) {
+   if (!ret)
+   ret = size;
+
+   break;
+   }
+
+   nr_segs--;
+   iov++;
+   ret += size;
+   }
+
+   if (pipe->inode)
+   mutex_unlock(>inode->i_mutex);
+
+   return ret;
+}
+
 /*
  * vmsplice splices a user address range into a pipe. It can be thought of
  * as splice-from-memory, where the regular splice is splice-from-file (or
  * to file). In both cases the output is a pipe, naturally.
- *
- * Note that vmsplice only supports splicing _from_ user memory to a pipe,
- * not the other way around. Splicing from user memory is a simple operation
- * that can be supported without any funky alignment restrictions or nasty
- * vm tricks. We simply map in the user memory and fill them into a pipe.
- * The reverse isn't quite as easy, though. There are two possible solutions
- * for that:
- *
- * - memcpy() the data internally, at which point we might as well just
- *   do a regular read() on the buffer 

Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jens Axboe
On Wed, May 30 2007, Zach Brown wrote:
> > Yeah, it'll confuse CFQ a lot actually. The threads either need to share
> > an io context (clean approach, however will introduce locking for things
> > that were previously lockless), or CFQ needs to get better support for
> > cooperating processes.
> 
> Do let me know if I can be of any help in this.

Thanks, it should not be a lot of work though.

> > For the fio testing, we can make some improvements there. Right now you
> > don't get any concurrency of the io requests if you set eg iodepth=32,
> > as the 32 requests will be submitted as a linked chain of atoms. For io
> > saturation, that's not really what you want.
> 
> Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm
> using fio's libaio engine.  I'm not testing the syslet syscall interface
> yet.

Ah ok, then there's no issue from that end!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Zach Brown
> due to the added syscall. (Maybe we can just get that reserved
> upstream now?)

Maybe, but we'd have to agree on the bare syslet interface that is being
supported :).

Personally, I'd like that to be the simplest thing that works for people
and I'm not convinced that the current syslet-specific syscalls are that.
Certainly not the atom interface, anyway.

+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+   unsigned long val, new_val;
+
+   if (get_user(val, uptr))
+   return -EFAULT;
+   /*
+* inc == 0 means 'read memory value':
+*/
+   if (!inc)
+   return val;
+
+   new_val = val + inc;
+   if (__put_user(new_val, uptr))
+   return -EFAULT;
+
+   return new_val;
+}

A syscall for *long addition* strikes me as a bit much, I have to admit.
Where do we stop?  (Where's the compat wrapper? :))

Maybe this would be fine for some wildly aggressive optimization some
number of years in the future when we have millions of syslet interface
users complaining about the cycle overhead of their syslet engines, but
it seems like we can do something much less involved in the first pass
without harming the possibility of promising to support this complex
optimization in the future.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Zach Brown
> Yeah, it'll confuse CFQ a lot actually. The threads either need to share
> an io context (clean approach, however will introduce locking for things
> that were previously lockless), or CFQ needs to get better support for
> cooperating processes.

Do let me know if I can be of any help in this.

> For the fio testing, we can make some improvements there. Right now you
> don't get any concurrency of the io requests if you set eg iodepth=32,
> as the 32 requests will be submitted as a linked chain of atoms. For io
> saturation, that's not really what you want.

Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm
using fio's libaio engine.  I'm not testing the syslet syscall interface
yet.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > To echo Uli and paraphrase an ad, "it's the interface, silly."
> 
> THERE IS NO INTERFACE! You're just making that up, and glossing over 
> the most important part of the whole thing!
> 
> If you could actually point to something specific that matches what 
> everybody needs, and is architecture-neutral, it would be a different 
> issue. As is, you're just saying "memory-mapped interfaces" without 
> actually going into enough detail to show HOW MUCH IT SUCKS.
> 
> There really are very few programs that would use them. [...]

looking over the list of our new generic APIs (see further below) i 
think there are three important things that are needed for an API to 
become widely used:

 1) it should solve a real problem (ha ;-), it should be intuitive to 
humans and it should fit into existing things naturally.

 2) it should be ubiquitous. (if it's about IO it should cover block IO,
network IO, timers, signals and everything) Even if it might look
silly in some of the cases, having complete, utter, no compromises,
100% coverage for everything massively helps the uptake of an API, 
because it allows the user-space coder to pick just one paradigm 
that is closest to his application and stick to it and only to it.

 3) it should be end-to-end supported by glibc.

our failed API attempts so far were:

 - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
   (couldnt be used in certain types of scenarios so was unintuitive.)
   splice() fixes this almost completely.

 - KAIO. It fails on #2 and #3.

our more successful new APIs:

 - futexes. After some hickups they form the base of all modern 
   user-space locking.

 - splice. (a bit too early to tell but it's looking good so far. Would
   be nice if someone did a brute-force memcpy() based vmsplice to user
   memory, just to make usage fully symmetric.)

partially successful, not yet failed new APIs:

 - epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but
   not completely). Despite the non-complete coverage of event domains a
   good number of apps are using it, and in particular a couple really
   'high end' apps with massive amounts of event sources - which apps 
   would have no chance with poll, select or threads.

 - inotify. It's being used quite happily on the desktop, despite some
   of its limitations. (Possibly integratable into epoll?)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ingo Molnar wrote:
> we should perhaps enable glibc to have its separate fd namespace (or 
> 'hidden' file descriptors at the upper end of the fd space) so that it 
> can transparently listen to netlink events (or do epoll),

Something like this would only work reliably if you have actual
protection coming with it.  Also, there are still reasons why an
application might want to see, close, handle, whatever these descriptors
in a separate namespace.

I think such namespaces are a broken concept.  How many do you want to
introduce?  Plus, then you get away from the normal file descriptor
interfaces anyway.  If you'd represent these alternative namespace
descriptors with ordinary ints you gain nothing.  You'd have to use
tuples (namespace,descriptor) and then you need a whole set of new
interfaces or some sticky namespace selection which will only cause
problems (think signal delivery).


> without 
> impacting the application fd namespace - instead of ducking to a memory 
> based API as a workaround.

It's not "ducking".  Memory mapping is one of the most natural
interfaces.  Just because Unix/Linux is built around the concept of file
descriptors does not mean this is the ultimate in usability.  File
descriptors are in fact clumsy: if you have a file descriptor to read
and write data, all auxiliary data for that communication must be
transferred out-of-band (e.g, fcntl) or in very magical and hard to use
ways (recvmsg, sendmsg).  With a memory based event mechanism this
auxiliary data can be stored in memory along with the event notification.



> it is a serious flexibility issue that should not be ignored. The 
> unified fd space is a blessing on one hand because it's simple and 

Too simple.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXZqX2ijCOnn/RHQRAsSFAKCNrd8/sRss1wBA9hkpnYIeALDbXQCfRNAb
yZy2Nofz2CgDo9PQYK3C/bo=
=klUJ
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Jeff Garzik wrote:
> 
> You snipped the key part of my response, so I'll say it again:
> 
> Event rings (a) most closely match what is going on in the hardware and (b)
> often closely match what is going on in multi-socket, event-driven software
> application.

I have rather strong counter-arguments:

 (a) yes, it's how hardware does it, but if you actually look at hardware, 
 you quickly realize that every single piece of hardware uses a 
 *different* ring interface.

 This should really tell you something. In fact, it may not be rings 
 at all, but structures with more complex formats (eg the USB 
 descriptors).

 (b) yes, event-driven software tends to use some data structures that are 
 sometimes approximated by event rings, but they all use *different* 
 software structures. There simply *is* no common "event" structure: 
 each program tends to have its own issues, it's own allocation 
 policies, and its own "ring" structures.

 They may not be rings at all. They can be priority queues/heaps or 
 other much more complex structures.

> To echo Uli and paraphrase an ad, "it's the interface, silly."

THERE IS NO INTERFACE! You're just making that up, and glossing over the 
most important part of the whole thing! 

If you could actually point to something specific that matches what 
everybody needs, and is architecture-neutral, it would be a different 
issue. As is, you're just saying "memory-mapped interfaces" without 
actually going into enough detail to show HOW MUCH IT SUCKS.

There really are very few programs that would use them. We had a trivial 
benchmark, the only function of which was to show usage, and here Ingo and 
Evgeniy are (once more) talking about bugs in that one months later.

THAT should tell you something.

Make poll/select/aio/read etc faster. THAT is where  the payoffs are.

In fact, if somebody wants to look at a standard interface that could be 
speeded up, the prime thing to look at is "readdir()" (aka getdents). 
Making _that_ thing go faster and scale better and do read-ahead is likely 
to be a lot more important for performance. It was one of the bottle-necks 
for samba several years ago, and nobody has really tried to improve it.

And yes, that's because it's hard - people would rather make up new 
interfaces that are largely irrelevant even before they are born, than 
actually try to improve important existing ones.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Ingo Molnar wrote:
>
> * Ulrich Drepper <[EMAIL PROTECTED]> wrote:
> > 
> > I'm not going to judge your tests but saying there are no significant 
> > advantages is too one-sided.  There is one huge advantage: the 
> > interface.  A memory-based interface is simply the best form.  File 
> > descriptors are a resource the runtime cannot transparently consume.
> 
> yeah - this is a fundamental design question for Linus i guess :-)

Well, quite frankly, to me, the most important part of syslets is that if 
they are done right, they introduce _no_ new interfaces at all that people 
actually use.

Over the years, we've done lots of nice "extended functionality" stuff. 
Nobody ever uses them. The only thing that gets used is the standard stuff 
that everybody else does too.

So when it comes to syslets, the most important interface will be the 
existing aio_read() etc interfaces _without_ any in-memory stuff at all, 
and everything done by the kernel to just make it look exactly like it 
used to look. And the biggest advantage is that it simplifies the internal 
kernel code, and makes us use the same code for aio and non-aio (and I 
think we have a good possibility of improving performance too, if only 
because we will get much more natural and fine-grained scheduling points!)

Any extended "direct syslets" use is technically _interesting_, but 
ultimately almost totally pointless. Which was why I was pushing really 
really hard for a simple interface and not being too clever or exposing 
internal designs too much. An in-memory thing tends to be the absolute 
_worst_ interface when it comes to abstraction layers and future changes.

> glibc (and other infrastructure libraries) have a fundamental problem: 
> they cannot (and do not) presently use persistent file descriptors to 
> make use of kernel functionality, due to ABI side-effects.

glibc has a more fundamental problem: the "fun" stuff is generally not 
worth it. 

For example, any AIO thing that requires glibc to be rewritten is almost 
totally uninteresting. It should work with _existing_ binaries, and 
_existing_ ABI's to be useful - since 99% of all AIO users are binary- 
only and won't recompile for some experimental library.

The whole epoll/kevent flame-wars have ignored a huge issue: almost nobody 
uses either. People still use poll and select, to such an _overwhelming_ 
degree that it almost doesn't even matter if you were to make the 
alternatives orders of magnitude faster - it wouldn't even be visible. 

> we should perhaps enable glibc to have its separate fd namespace (or 
> 'hidden' file descriptors at the upper end of the fd space) so that it 
> can transparently listen to netlink events (or do epoll), without 
> impacting the application fd namespace - instead of ducking to a memory 
> based API as a workaround.

Yeah, I don't think it would be at all wrong to have "private file 
descriptors". I'd prefer that over memory-based (for all the abstraction 
issues, and because a lot of things really *are* about file descriptors!). 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> epoll is very much is capable of doing it - but why bother if 
> something more flexible than a ring can be used and the performance 
> difference is negligible? (Read my other reply in this thread for 
> further points.)

in particular i'd like to (re-)stress this point:

 Thirdly, our main problem was not the structure of epoll, our main
 problem was that event APIs were not widely available, so applications
 couldnt go to a pure event based design - they always had to handle
 certain types of event domains specially, due to lack of coverage. The
 latest epoll patches largely address that. This was a huge barrier
 against adoption of epoll.

starting with putting limits into the design by going to over-smart data 
structures like rings is just stupid. Lets fix, enhance and speed up 
what we have now (epoll) so that it becomes ubiquitous, and _then_ we 
can extend epoll to maybe fill events into rings. We should have our 
priorities right and should stop rewriting the whole world, especially 
when it comes to user APIs. Right now we have _no_ event API with 
complete coverage, and that's far more of a problem than the actual 
micro-structure of the API.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Jeff Garzik <[EMAIL PROTECTED]> wrote:

> >>You should pick up the kevent work :)
> >
> >3 months ago i verified the published kevent vs. epoll benchmark and 
> >found that benchmark to be fatally flawed. When i redid it properly 
> >kevent showed no significant advantage over epoll. Note that i did 
> >those measurements _before_ the recent round of epoll speedups. So 
> >unless someone does believable benchmarks i consider kevent an 
> >over-hyped, mis-benchmarked complication to do something that epoll 
> >is perfectly capable of doing.
> 
> You snipped the key part of my response, so I'll say it again:
> 
> Event rings (a) most closely match what is going on in the hardware 
> and (b) often closely match what is going on in multi-socket, 
> event-driven software application.

event rings are just pure data structures that describe a set of data, 
and they have advantages and disadvantages. For the record, we've 
already got direct experience with rings as software APIs: they were 
used for KAIO and they were an implementational and maintainance 
nightmare and nobody used them. Kevent might be better, but you make it 
sound as if it was a trivial design choice while it certainly isnt!

Sure, for hardware interfaces like networking cards tx and rx rings are 
the best thing but that is apples to oranges: hardware itself is about 
_limited_ physical resources, matching a _limited_ data structure like a 
ring quite well. But for software APIs, the built-in limit of rings 
makes it a baroque data structure that has a fair share disadvantages in 
addition to its obvious advantages.

> This is not something epoll is capable of doing, at the present time.

epoll is very much is capable of doing it - but why bother if something 
more flexible than a ring can be used and the performance difference is 
negligible? (Read my other reply in this thread for further points.)

but, for the record, syslets very much use a completion ring, so i'm not 
fundamentally opposed to the idea. I just think it's seriously 
over-hyped, just like most other bits of the kevent approach. (Nor do we 
have to attach this to syslets and threadlets - kevents are an 
orthogonal approach not directly related to asynchronous syscalls - 
syslets/threadlets can make use of epoll just as much as they can make 
use of kevent APIs.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Evgeniy Polyakov
On Wed, May 30, 2007 at 10:54:00AM +0200, Ingo Molnar ([EMAIL PROTECTED]) wrote:
> 
> * Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:
> 
> > I did not want to start with another round of ping-pong insults :), 
> > but, Ingo, you did not show that kevent works worse. I did show that 
> > sometimes it works better. It flawed from 0 to 30% win in that tests, 
> > in results Johann Bork presented kevent and epoll behaved the same. In 
> > results I posted earlier, I said, that sometimes epoll behaved better, 
> > sometimes kevent. [...]
> 
> let me refresh your recollection:
> 
>   http://lkml.org/lkml/2007/2/25/116
> 
> where you said:
> 
>  "But note, that on my athlon64 3500 test machine kevent is about 7900
>   requests per second compared to 4000+ epoll, so expect a challenge."

You can also find in that threads that I managed to run epoll server on 
that machine with 9k requests per second, although that was not
reproducible again.

> for a long time you made much fuss about how kevents is so much better 
> and how epoll cannot perform and scale as well (you said various 
> arguments why that is supposedly so), and some people bought into the 
> performance argument and advocated kevent due to its supposed 
> performance and scalability advantages - while now we are down to "epoll 
> and kevent are break-even"?

You just draw a picture you want to see.

Even on the kevent page I have links to other people's benchmarks, which
show how kevent behave compared to epoll in theirs load.
_My_ tests showed kevent performance win, you tuned my (can be
broken) epoll code and results changed - this is developemnt process,
where things are not obtained from the air.

> in my book that is way too much of a difference, it is (best-case) a way 
> too sloppy approach to something as fundamental as Linux's basic event 
> model and design, and it is also compounded by your continued "nothing 
> happened, really, lets move on" stance. Losing trust is easy, winning it 
> back is hard. Let me reuse a phrase of yours: "expect a challenge".

Well, I do not care much about what people think I did wrong or right.
There are obviously bad and good ideas and implementations.
I might be absolutely wrong with something, but that is a process of
solving problems, which I really enjoy.

I just want that there sould be no personal insults, if I made such things,
shame on me :)

>   Ingo

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jeff Garzik

Ingo Molnar wrote:

* Jeff Garzik <[EMAIL PROTECTED]> wrote:


You should pick up the kevent work :)


3 months ago i verified the published kevent vs. epoll benchmark and 
found that benchmark to be fatally flawed. When i redid it properly 
kevent showed no significant advantage over epoll. Note that i did those 
measurements _before_ the recent round of epoll speedups. So unless 
someone does believable benchmarks i consider kevent an over-hyped, 
mis-benchmarked complication to do something that epoll is perfectly 
capable of doing.


You snipped the key part of my response, so I'll say it again:

Event rings (a) most closely match what is going on in the hardware and 
(b) often closely match what is going on in multi-socket, event-driven 
software application.


To echo Uli and paraphrase an ad, "it's the interface, silly."

This is not something epoll is capable of doing, at the present time.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

> On Wed, May 30, 2007 at 10:42:52AM +0200, Ingo Molnar ([EMAIL PROTECTED]) 
> wrote:
> > it is a serious flexibility issue that should not be ignored. The 
> > unified fd space is a blessing on one hand because it's simple and 
> > powerful, but it's also a curse because nested use of the fd space for 
> > libraries is currently not possible. But it should be detached from any
> > fundamental question of kevent vs. epoll. (By improving library use of
> > file descriptors we'll improve the utility of all syscalls - by ducking
> > to a memory based API we only solve that particular event based usage.)
> 
> There is another issue with file descriptors - userspace must dig into 
> kernel each time it wants to get a new set of events, while with 
> memory based approach it has them without doing so. After it has 
> returned from kernel and know that there are some evetns, kernel can 
> add more of them into the ring (if there is a place) and userspace 
> will process them withouth additional syscalls.

Firstly, this is not a fundamental property of epoll. If we wanted to, 
it would be possible to extend epoll to fill in a ring of events from 
the wakeup handler. It's an incremental add-on to epoll that should not 
impact the design. How much info to put into a single event is another 
incremental thing - for most of the high-performance cases all the 
information we need is the type of the event and the fd it occured on. 
Currently epoll supports that minimal approach.

Secondly, our current syscall overhead is below 0.1 usecs on latest 
hardware:

  dione:~/l> ./lat_syscall null
  Simple syscall: 0.0911 microseconds

so you need millions of events _per cpu_ for the syscall overhead to 
show up.

Thirdly, our main problem was not the structure of epoll, our main 
problem was that event APIs were not widely available, so applications 
couldnt go to a pure event based design - they always had to handle 
certain types of event domains specially, due to lack of coverage. The
latest epoll patches largely address that. This was a huge barrier
against adoption of epoll.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Evgeniy Polyakov
On Wed, May 30, 2007 at 10:42:52AM +0200, Ingo Molnar ([EMAIL PROTECTED]) wrote:
> it is a serious flexibility issue that should not be ignored. The 
> unified fd space is a blessing on one hand because it's simple and 
> powerful, but it's also a curse because nested use of the fd space for 
> libraries is currently not possible. But it should be detached from any
> fundamental question of kevent vs. epoll. (By improving library use of
> file descriptors we'll improve the utility of all syscalls - by ducking
> to a memory based API we only solve that particular event based usage.)

There is another issue with file descriptors - userspace must dig into
kernel each time it wants to get a new set of events, while with memory
based approach it has them without doing so. After it has returned from
kernel and know that there are some evetns, kernel can add more of them
into the ring (if there is a place) and userspace will process them
withouth additional syscalls.
Although syscall overhead is very small, it does exist and should not be 
ignored in the design.

> 
>   Ingo

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

> I did not want to start with another round of ping-pong insults :), 
> but, Ingo, you did not show that kevent works worse. I did show that 
> sometimes it works better. It flawed from 0 to 30% win in that tests, 
> in results Johann Bork presented kevent and epoll behaved the same. In 
> results I posted earlier, I said, that sometimes epoll behaved better, 
> sometimes kevent. [...]

let me refresh your recollection:

  http://lkml.org/lkml/2007/2/25/116

where you said:

 "But note, that on my athlon64 3500 test machine kevent is about 7900
  requests per second compared to 4000+ epoll, so expect a challenge."

for a long time you made much fuss about how kevents is so much better 
and how epoll cannot perform and scale as well (you said various 
arguments why that is supposedly so), and some people bought into the 
performance argument and advocated kevent due to its supposed 
performance and scalability advantages - while now we are down to "epoll 
and kevent are break-even"?

in my book that is way too much of a difference, it is (best-case) a way 
too sloppy approach to something as fundamental as Linux's basic event 
model and design, and it is also compounded by your continued "nothing 
happened, really, lets move on" stance. Losing trust is easy, winning it 
back is hard. Let me reuse a phrase of yours: "expect a challenge".

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Ulrich Drepper <[EMAIL PROTECTED]> wrote:

> Ingo Molnar wrote:
> > 3 months ago i verified the published kevent vs. epoll benchmark and 
> > found that benchmark to be fatally flawed. When i redid it properly 
> > kevent showed no significant advantage over epoll.
> 
> I'm not going to judge your tests but saying there are no significant 
> advantages is too one-sided.  There is one huge advantage: the 
> interface.  A memory-based interface is simply the best form.  File 
> descriptors are a resource the runtime cannot transparently consume.

yeah - this is a fundamental design question for Linus i guess :-) glibc 
(and other infrastructure libraries) have a fundamental problem: they 
cannot (and do not) presently use persistent file descriptors to make 
use of kernel functionality, due to ABI side-effects. [applications can 
dup into an fd used by glibc, applications can close it - shells close 
fds blindly for example, etc.] Today glibc simply cannot open a file 
descriptor and keep it open while application code is running due to 
these problems.

we should perhaps enable glibc to have its separate fd namespace (or 
'hidden' file descriptors at the upper end of the fd space) so that it 
can transparently listen to netlink events (or do epoll), without 
impacting the application fd namespace - instead of ducking to a memory 
based API as a workaround.

it is a serious flexibility issue that should not be ignored. The 
unified fd space is a blessing on one hand because it's simple and 
powerful, but it's also a curse because nested use of the fd space for 
libraries is currently not possible. But it should be detached from any
fundamental question of kevent vs. epoll. (By improving library use of
file descriptors we'll improve the utility of all syscalls - by ducking
to a memory based API we only solve that particular event based usage.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Evgeniy Polyakov
Hi Ingo, developers.

On Wed, May 30, 2007 at 09:20:55AM +0200, Ingo Molnar ([EMAIL PROTECTED]) wrote:
> 
> * Jeff Garzik <[EMAIL PROTECTED]> wrote:
> 
> > You should pick up the kevent work :)
> 
> 3 months ago i verified the published kevent vs. epoll benchmark and 
> found that benchmark to be fatally flawed. When i redid it properly 
> kevent showed no significant advantage over epoll. Note that i did those 
> measurements _before_ the recent round of epoll speedups. So unless 
> someone does believable benchmarks i consider kevent an over-hyped, 
> mis-benchmarked complication to do something that epoll is perfectly 
> capable of doing.

I did not want to start with another round of ping-pong insults :), but, 
Ingo, you did not show that kevent works worse. I did show that
sometimes it works better. It flawed from 0 to 30% win in that tests, 
in results Johann Bork presented kevent and epoll behaved the same. In
results I posted earlier, I said, that sometimes epoll behaved better, 
sometimes kevent. What does it say? Just the fact, that in that given 
workload result was the one we saw. Nothing more, nothing less.
It does not show something is broken, and definitely not that it is:
citation1:
we're heading to yet-another monolitic interface, we're heading with no
valid reasons given if other than some handwaving.
citation2:
consider kevent an over-hyped, mis-benchmarked complication to do 
something that epoll is perfectly

Getting into account another features kevent has (and what it was
designed for originally - for network AIO, which is quite hard 
(if ever possible) with files and epoll, I'm not talking about syslets
as AIO, it is different approach and likely it is simpler, getting even
only that it is already very good), it is not what people said in above 
citations.

It looks like you have some personal insults on that, which I do not
understand. But it has nothing with technical side of the problem, so
lets stop such rethoric and concentrate on real problem and forget any
possible personal issues which might be raised sometimes :).

Although I closed kevent and eventfs projects, I would gladly continue
if we can and want to have progress in that area.

Thanks.

>   Ingo

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jens Axboe
On Tue, May 29 2007, Zach Brown wrote:

Thanks for picking this up, Zach!

>  - cfq gets confused, share io_context amongst threads?

Yeah, it'll confuse CFQ a lot actually. The threads either need to share
an io context (clean approach, however will introduce locking for things
that were previously lockless), or CFQ needs to get better support for
cooperating processes. The problem is that CFQ will wait for a dependent
IO for a given process, which may arrive from a totally unrelated
process.

For the fio testing, we can make some improvements there. Right now you
don't get any concurrency of the io requests if you set eg iodepth=32,
as the 32 requests will be submitted as a linked chain of atoms. For io
saturation, that's not really what you want.

I'll take a stab at improving both of the above.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ingo Molnar wrote:
> 3 months ago i verified the published kevent vs. epoll benchmark and 
> found that benchmark to be fatally flawed. When i redid it properly 
> kevent showed no significant advantage over epoll.

I'm not going to judge your tests but saying there are no significant
advantages is too one-sided.  There is one huge advantage: the
interface.  A memory-based interface is simply the best form.  File
descriptors are a resource the runtime cannot transparently consume.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXShu2ijCOnn/RHQRAi5ZAJ920rRneulUMjTETu6XoiOaOi7SLgCfbmO+
UDM1CLqbaEZREAMnuOWRzuY=
=CERV
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Zach Brown <[EMAIL PROTECTED]> wrote:

> > Having async request and response rings would be quite useful, and 
> > most closely match what is going on under the hood in the kernel and 
> > hardware.
> 
> Yeah, but I have lots of competing thoughts about this.

note that async request and response rings are implemented already in 
essence: that's how FIO uses syslets. The linked list of syslet atoms is 
the 'request ring' (it's just that 'ring' is not a hard-enforced data 
structure - you can use other request formats too), and the completion 
ring is the 'response ring'.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ingo Molnar

* Jeff Garzik <[EMAIL PROTECTED]> wrote:

> You should pick up the kevent work :)

3 months ago i verified the published kevent vs. epoll benchmark and 
found that benchmark to be fatally flawed. When i redid it properly 
kevent showed no significant advantage over epoll. Note that i did those 
measurements _before_ the recent round of epoll speedups. So unless 
someone does believable benchmarks i consider kevent an over-hyped, 
mis-benchmarked complication to do something that epoll is perfectly 
capable of doing.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Davide Libenzi wrote:
 
 Here I think we are forgetting that glibc is userspace and there's no 
 separation between the application code and glibc code. An application 
 linking to glibc can break glibc in thousand ways, indipendently from fds 
 or not fds. Like complaining that glibc is broken because printf() 
 suddendly does not work anymore ;)

No, Davide, the problem is that some applications depend on getting 
_specific_ file descriptors.

For example, if you do

close(0);
.. something else ..
if (open(myfile, O_RDONLY)  0)
exit(1);

you can (and should) depend on the open returning zero.

So library routines *must not* open file descriptors in the normal space.

(The same is true of real applications doing the equivalent of

for (i = 0; i  NR_OPEN; i++)
close(i);

to clean up all file descriptors before doing something new. And yes, I 
think it was bash that used to *literally* do something like that a long 
time ago.

Another example of the same thing: people open file descriptors and know 
that they'll be dense in the result, and then use select() on them.

So it's true that file descriptors can't be used randomly by the standard 
libraries - they'd need to have some kind of separate private space.

Which *could* be something as simple as saying bit 30 in the file 
descriptor specifies a separate fd space along with some flags to make 
open and friends return those separate fd's. That makes them useless for 
select() (which assumes a flat address space, of course), but would be 
useful for just about anything else.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Mark Lord wrote:

I wonder how useful it would be to reimplement sendfile()
using splice(), either in glibc or inside the kernel itself?


I'd like that, if only because right now we have two separate paths that 
kind of do the same thing, and splice really is the only one that is 
generic.


I thought Jens even had some experimental patches for it. It might be 
worth to just do it - there's some internal overhead, but on the other 
hand, it's also likely the best way to make sure any issues get sorted 
out.




Last time I played with splice(), I found a bug with readahead logic, most 
probably because nobody but me tried it before.


(corrected by Fengguang Wu in commit 9ae9d68cbf3fe0ec17c17c9ecaa2188ffb854a66 )

So yes, reimplement sendfile() should help to find last splice() bugs, and as 
a bonus it could add non blocking disk io, (O_NONBLOCK on input file - socket)



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Linus Torvalds wrote:

 On Wed, 30 May 2007, Davide Libenzi wrote:
  
  Here I think we are forgetting that glibc is userspace and there's no 
  separation between the application code and glibc code. An application 
  linking to glibc can break glibc in thousand ways, indipendently from fds 
  or not fds. Like complaining that glibc is broken because printf() 
  suddendly does not work anymore ;)
 
 No, Davide, the problem is that some applications depend on getting 
 _specific_ file descriptors.
 
 For example, if you do
 
   close(0);
   .. something else ..
   if (open(myfile, O_RDONLY)  0)
   exit(1);
 
 you can (and should) depend on the open returning zero.
 
 So library routines *must not* open file descriptors in the normal space.
 
 (The same is true of real applications doing the equivalent of
 
   for (i = 0; i  NR_OPEN; i++)
   close(i);
 
 to clean up all file descriptors before doing something new. And yes, I 
 think it was bash that used to *literally* do something like that a long 
 time ago.

Right. I misunderstood Uli and Ingo. I thought it was like trying to 
protect glibc from intentional application mis-behaviour.



 Another example of the same thing: people open file descriptors and know 
 that they'll be dense in the result, and then use select() on them.
 
 So it's true that file descriptors can't be used randomly by the standard 
 libraries - they'd need to have some kind of separate private space.
 
 Which *could* be something as simple as saying bit 30 in the file 
 descriptor specifies a separate fd space along with some flags to make 
 open and friends return those separate fd's. That makes them useless for 
 select() (which assumes a flat address space, of course), but would be 
 useful for just about anything else.

I think it can be solved in a few ways. Yours or Ingo's (or something 
else) can work, to solve the above legacy fd space expectations.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Eric Dumazet wrote:
 
 So yes, reimplement sendfile() should help to find last splice() bugs, and as
 a bonus it could add non blocking disk io, (O_NONBLOCK on input file -
 socket)

Well, to get those kinds of advantages, you'd have to use splice directly, 
since sendfile() hasn't supported nonblocking disk IO, and the interface 
doesn't really allow for it.

In fact, since nonblocking accesses require also some *polling* method, 
and we don't have that for files, I suspect the best option for those 
things is to simply mix AIO and splice(). AIO tends to be the right thing 
for disk waits (read: short, often cached), and if we can improve AIO 
performance for the cached accesses (which is exactly what the threadlets 
should hopefully allow us to do), I would seriously suggest going that 
route.

But the pure use splice to _implement_ sendfile() thing is worth doing 
for all the other reasons, even if nonblocking file access is not likely 
one of them.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Davide Libenzi wrote:
Here I think we are forgetting that glibc is userspace and there's no 
separation between the application code and glibc code. An application 
linking to glibc can break glibc in thousand ways, indipendently from fds 
or not fds. Like complaining that glibc is broken because printf() 
suddendly does not work anymore ;)


No, Davide, the problem is that some applications depend on getting 
_specific_ file descriptors.




Fix the application, and not adding kernel bloat ?


For example, if you do

close(0);
.. something else ..
if (open(myfile, O_RDONLY)  0)
exit(1);

you can (and should) depend on the open returning zero.


Then you can also exclude multi-threading, since a thread (even not inside 
glibc) can also use socket()/pipe()/open()/whatever and take the zero file 
descriptor as well.


Frankly I dont buy this fd namespace stuff.

The only hardcoded thing in Unix is 0, 1 and 2 fds.
People usually take care of these, or should use a Microsoft OS.

POSIX mandates that open() returns the lowest available fd.
But this obviously works only if you dont have another thread messing with 
fds, or if you dont call a library function that opens a file.


Thats all.



So library routines *must not* open file descriptors in the normal space.

(The same is true of real applications doing the equivalent of

for (i = 0; i  NR_OPEN; i++)
close(i);


Quite buggy IMHO

This hack was to avoid bugs coming from ancestors applications, 
forking/execing a shell, and at times where one process could not open more 
than 20 files (ATT Unix, 21 years ago)


Unix has fcntl(fd, F_SETFD, FD_CLOEXEC). A library should use this to make 
sure fd is not propagated at exec() time.




to clean up all file descriptors before doing something new. And yes, I 
think it was bash that used to *literally* do something like that a long 
time ago.


Another example of the same thing: people open file descriptors and know 
that they'll be dense in the result, and then use select() on them.


poll() is nice. Even ATT Unix had it 21 years ago :)



So it's true that file descriptors can't be used randomly by the standard 
libraries - they'd need to have some kind of separate private space.


Which *could* be something as simple as saying bit 30 in the file 
descriptor specifies a separate fd space along with some flags to make 
open and friends return those separate fd's. That makes them useless for 
select() (which assumes a flat address space, of course), but would be 
useful for just about anything else.




Please dont do that. Second class fds.

Then what about having ten different shared libraries ? Third class fds ?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Eric Dumazet wrote:
  
  No, Davide, the problem is that some applications depend on getting
  _specific_ file descriptors.
 
 Fix the application, and not adding kernel bloat ?

No. The application is _correct_. It's how file descriptors are defined to 
work. 

 Then you can also exclude multi-threading, since a thread (even not inside
 glibc) can also use socket()/pipe()/open()/whatever and take the zero file
 descriptor as well.

Totally different. That's an application internal issue. It does *not* 
mean that we can break existing standards.

 The only hardcoded thing in Unix is 0, 1 and 2 fds.

Wrong. I already gave an example of real code that just didn't bother to 
keep track of which fd's it had open, and closed them all. Partly, in 
fact, because you can't even _know_ which fd's you have open when somebody 
else just execve's you.

You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. 

You cannot just change years and years of coding practice, and standard 
documentations. The behaviour of file descriptors is a fact. Ignoring that 
fact because you don't like it is naïve and simply not realistic.

Linus

Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Eric Dumazet wrote:

So yes, reimplement sendfile() should help to find last splice() bugs, and as
a bonus it could add non blocking disk io, (O_NONBLOCK on input file -
socket)


Well, to get those kinds of advantages, you'd have to use splice directly, 
since sendfile() hasn't supported nonblocking disk IO, and the interface 
doesn't really allow for it.


sendfile() interface doesnt allow it, but if you open(somediskfile, O_RDONLY 
| O_NONBLOCK), then splice() based sendfile() can perform a non blocking disk 
io, (while starting an io with readahead)


I actually use this trick myself :)

(splice(disk - pipe, NONBLOCK), splice(pipe - worker))

non blocking disk io, + zero copy :)



In fact, since nonblocking accesses require also some *polling* method, 
and we don't have that for files, I suspect the best option for those 
things is to simply mix AIO and splice(). AIO tends to be the right thing 
for disk waits (read: short, often cached), and if we can improve AIO 
performance for the cached accesses (which is exactly what the threadlets 
should hopefully allow us to do), I would seriously suggest going that 
route.


But the pure use splice to _implement_ sendfile() thing is worth doing 
for all the other reasons, even if nonblocking file access is not likely 
one of them.


Linus




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Linus Torvalds wrote:
   for (i = 0; i  NR_OPEN; i++)
   close(i);
 
 to clean up all file descriptors before doing something new. And yes, I 
 think it was bash that used to *literally* do something like that a long 
 time ago.

Indeed.  It was not only bash, though, I fixed probably a dozen
applications.  But even the new and better solution (readdir of
/proc/self/fd) does not prevent the problem of closing descriptors the
system might still need and the application doesn't know about.


 Which *could* be something as simple as saying bit 30 in the file 
 descriptor specifies a separate fd space along with some flags to make 
 open and friends return those separate fd's.

I don't like special cases.  For me things better come in quantities 0,
1, and unlimited (well, reasonable high limit).  Otherwise, who gets to
use that special namespace?  The C library is not the only body of code
which would want to use descriptors.

And then the semantics: do these descriptors should show up in
/proc/self/fd?  Are there separate directories for each namespace?  Do
they count against the rlimit?

This seems to me like a shot from the hips without thinking about other
possibilities.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXemS2ijCOnn/RHQRAjsFAKCGhakZosSsRzCwOvruxECbzcwIzACeJAiY
z9ql4FJa8XTSiZzRG79ocwM=
=0E7f
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Ulrich Drepper wrote:
 
 I don't like special cases.  For me things better come in quantities 0,
 1, and unlimited (well, reasonable high limit).  Otherwise, who gets to
 use that special namespace?  The C library is not the only body of code
 which would want to use descriptors.

Well, don't think of it as a special case at all: think of bit 30 as a 
the user asked for a non-linear fd.

In fact, to make it effective, I'd suggest literally scrambling the low 
bits (using, for example, some silly per-boot xor value to to actually 
generate the true index - the equivalent of a really stupid randomizer). 

That way you'd have the legacy linear space, and a separate non-linear 
space where people simply *cannot* make assumptions about contiguous fd 
allocations. There's no special case there - it's just an extension which 
explicitly allows us to say if you do that, your fd's won't be allocated 
the traditional way any more, but you *can* mix the traditional and the 
non-linear allocation.

 And then the semantics: do these descriptors should show up in
 /proc/self/fd?  Are there separate directories for each namespace?  Do
 they count against the rlimit?

Oh, absolutely. The'd be real fd's in every way. People could use them 
100% equivalently (and concurrently) with the traditional ones. The whole, 
and the _only_ point, would be that it breaks the legacy guarantees of a 
dense fd space.

Most apps don't actually *need* that dense fd space in any case. But by 
defaulting to it, we wouldn't break those (few) apps that actually depend 
on it.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Eric Dumazet wrote:

  So library routines *must not* open file descriptors in the normal space.
  
  (The same is true of real applications doing the equivalent of
  
  for (i = 0; i  NR_OPEN; i++)
  close(i);
 
 Quite buggy IMHO

Looking at it now, I'd agree (although I think I have that somewhere in my 
old code too). Consider though, that such code is contained also in 
reference books like Richard Stevens UNIX Network Programming.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jeremy Fitzhardinge
Linus Torvalds wrote:
 Which *could* be something as simple as saying bit 30 in the file 
 descriptor specifies a separate fd space along with some flags to make 
 open and friends return those separate fd's. That makes them useless for 
 select() (which assumes a flat address space, of course), but would be 
 useful for just about anything else.
   

Some programs - legitimately, I think - scan /proc/self/fd to close
everything.  The question is whether the glibc-private fds should appear
there.  And something like a close-on-fork flag might be useful,
though I guess glibc can keep track of its own fds closely enough to not
need something like that.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jeremy Fitzhardinge
Ulrich Drepper wrote:
 I don't like special cases.  For me things better come in quantities 0,
 1, and unlimited (well, reasonable high limit).  Otherwise, who gets to
 use that special namespace?  The C library is not the only body of code
 which would want to use descriptors.

Valgrind could certainly make use of it.  It currently reserves a set of
fds high enough, and tries hard to hide them from apps, but
/proc/self/fd makes it intractable in general (there was only so much
simulation I was willing to do in Valgrind).

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Ulrich Drepper wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Linus Torvalds wrote:
  for (i = 0; i  NR_OPEN; i++)
  close(i);
  
  to clean up all file descriptors before doing something new. And yes, I 
  think it was bash that used to *literally* do something like that a long 
  time ago.
 
 Indeed.  It was not only bash, though, I fixed probably a dozen
 applications.  But even the new and better solution (readdir of
 /proc/self/fd) does not prevent the problem of closing descriptors the
 system might still need and the application doesn't know about.

Please, do not drop me out of the Cc list. If you have a valid point, you 
should be able to carry it forward regardless, no?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Jeremy Fitzhardinge wrote:
 
 Some programs - legitimately, I think - scan /proc/self/fd to close
 everything.  The question is whether the glibc-private fds should appear
 there.  And something like a close-on-fork flag might be useful,
 though I guess glibc can keep track of its own fds closely enough to not
 need something like that.

Sure. I think there are things we can do (like make the non-linear fd's 
appear somewhere else, and make them close-on-exec by default etc).

And it's not like it's necessarily at all the only way to do things. 

I just threw it out as a possible solution - and one that is almost 
certainly *superior* to trying to work around the fd thing with some 
shared memory area which has tons of much more serious problems of its own 
(*).

Linus

(*) Ranging from: specialized-only interfaces, inability to pass it 
around, lack of any abstraction interfaces, and almost impossible to 
debug. The security implications of kernel and user space sharing 
read-write access to some shared area are also legion!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Linus Torvalds wrote:

  And then the semantics: do these descriptors should show up in
  /proc/self/fd?  Are there separate directories for each namespace?  Do
  they count against the rlimit?
 
 Oh, absolutely. The'd be real fd's in every way. People could use them 
 100% equivalently (and concurrently) with the traditional ones. The whole, 
 and the _only_ point, would be that it breaks the legacy guarantees of a 
 dense fd space.
 
 Most apps don't actually *need* that dense fd space in any case. But by 
 defaulting to it, we wouldn't break those (few) apps that actually depend 
 on it.

I agree. What would be a good interface to allocate fds in such area? We 
don't want to replicate syscalls, so maybe a special new dup function?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Linus Torvalds wrote:
 Well, don't think of it as a special case at all: think of bit 30 as a 
 the user asked for a non-linear fd.

This sounds easy but doesn't really solve all the issues.  Let me repeat
your example and the solution currently in use:

problem: application wants to close all file descriptors except a select
few, cleaning up what is currently open.  It doesn't know all the
descriptors that are open.  Maybe all this in preparation of an exec call.

Today the best method to do this is to readdir() /proc/self/fd and
exclude the descriptors on the whitelist.

If the special, non-sequential descriptors are also listed in that
directory the runtimes still cannot use them since they are visible.

If you go ahead with this, then at the very least add a flag which
causes the descriptor to not show up in /proc/*/fd.


You also have to be aware that open() is just one piece of the puzzle.
What about socket()?  I've cursed this interface many times before and
now it's biting you: there is parameter to pass a flag.  What about
transferring file descriptors via Unix domain sockets?  How can I decide
the transferred descriptor should be in the private namespace?

There are likely many many more problems and cornercases like this.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXfD12ijCOnn/RHQRAk4nAJ0Zjevd9Y0lQa/fLzKK+BshcLVbngCfSspI
ALNKu8VCKy7CvoIqJD3Xs/Y=
=+fM8
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Linus Torvalds wrote:
 
 Sure. I think there are things we can do (like make the non-linear fd's 
 appear somewhere else, and make them close-on-exec by default etc).

Side note: it might not even be a close-on-exec by default thing: it 
might well be a *always* close-on-exec.

That COE is pretty horrid to do, we need to scan a bitmap of those things 
on each exec. So it migth be totally sensible to just declare that the 
non-linear fd's would simply always be local, and never bleed across an 
execve).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Jeremy Fitzhardinge
Linus Torvalds wrote:
 Side note: it might not even be a close-on-exec by default thing: it 
 might well be a *always* close-on-exec.

 That COE is pretty horrid to do, we need to scan a bitmap of those things 
 on each exec. So it migth be totally sensible to just declare that the 
 non-linear fd's would simply always be local, and never bleed across an 
 execve).

Hm, I wouldn't limit the mechanism prematurely.  Using Valgrind as an
example of an alternate user of this mechanism, it would be useful to
use a pipe to transmit out-of-band information from an exec-er to an
exec-ee process.  At the moment there's a lot mucking around with
execve() to transmit enough information from the parent valgrind to its
successor.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Linus Torvalds a écrit :


On Wed, 30 May 2007, Eric Dumazet wrote:

No, Davide, the problem is that some applications depend on getting
_specific_ file descriptors.

Fix the application, and not adding kernel bloat ?


No. The application is _correct_. It's how file descriptors are defined to 
work. 


Then you can also exclude multi-threading, since a thread (even not inside
glibc) can also use socket()/pipe()/open()/whatever and take the zero file
descriptor as well.


Totally different. That's an application internal issue. It does *not* 
mean that we can break existing standards.



The only hardcoded thing in Unix is 0, 1 and 2 fds.


Wrong. I already gave an example of real code that just didn't bother to 
keep track of which fd's it had open, and closed them all. Partly, in 
fact, because you can't even _know_ which fd's you have open when somebody 
else just execve's you.


If someone really cares, /proc/self/fd can help. But one shouldn't care at all.

About the things that the process can do before execing() a process, file 
descriptors outside of 0,1,2 are the most obvious thing, but you also have 
alarm(), or stupid rlimits.




You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. 

You cannot just change years and years of coding practice, and standard 
documentations. The behaviour of file descriptors is a fact. Ignoring that 
fact because you don't like it is naïve and simply not realistic.


I want to change nothing. Current situation is fine and well documented, thank 
you.


If a program does for (i = 0; i  NR_OPEN; i++) close(i);, this 
*will*/*should* work as intended : close all files descriptors from 0 to 
NR_OPEN. Big deal.


But you wont find in a program :

FILE *fp = fopen(somefile, r);
for (i = 0; i  NR_OPEN; i++)
close(i);
while (fgets(buff, sizeof(buff), fp)) {
}


You and/or others want to add fd namespaces and other hacks.

I saw on this thread suspicious examples, I am waiting for a real one, 
justifying all this stuff.


After file descriptors separation, I guess we'll need memory space separation 
as well, signal separations (SIGALRM comes to mind), uid/gid separation, cpu 
time separation, and so on... setrlimit() layered for every shared lib.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Linus Torvalds


On Wed, 30 May 2007, Davide Libenzi wrote:
 
 I agree. What would be a good interface to allocate fds in such area? We 
 don't want to replicate syscalls, so maybe a special new dup function?

I'd do it with something like newfd = dup2(fd, NONLINEAR_FD) or similar, 
and just have NONLINEAR_FD be some magic value (for example, make it be 
0x4000 - the bit that says private, nonlinear in the first place).

But what's gotten lost in the current discussion is that we probably don't 
actually _need_ such a private space. I'm just saying that if the *choice* 
is between memory-mapped interfaces and a private fd-space, we should 
probably go for the latter. Everything is a file is the UNIX way, after 
all. But there's little reason to introduce private fd's otherwise.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Davide Libenzi
On Wed, 30 May 2007, Ulrich Drepper wrote:

 You also have to be aware that open() is just one piece of the puzzle.
 What about socket()?  I've cursed this interface many times before and
 now it's biting you: there is parameter to pass a flag.  What about
 transferring file descriptors via Unix domain sockets?  How can I decide
 the transferred descriptor should be in the private namespace?

Well, we can't just replicate/change every system call that creates a file 
descriptor. So I'm for something like:

int sys_fdup(int fd, int flags);

So you basically create your fds with their native/existing system calls, 
and then you dup/move them into the prefered fd space.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Eric Dumazet

Davide Libenzi a écrit :

On Wed, 30 May 2007, Linus Torvalds wrote:


And then the semantics: do these descriptors should show up in
/proc/self/fd?  Are there separate directories for each namespace?  Do
they count against the rlimit?
Oh, absolutely. The'd be real fd's in every way. People could use them 
100% equivalently (and concurrently) with the traditional ones. The whole, 
and the _only_ point, would be that it breaks the legacy guarantees of a 
dense fd space.


Most apps don't actually *need* that dense fd space in any case. But by 
defaulting to it, we wouldn't break those (few) apps that actually depend 
on it.


I agree. What would be a good interface to allocate fds in such area? We 
don't want to replicate syscalls, so maybe a special new dup function?




If the deal is to be able to get faster open()/socket()/pipe()/... calls by 
not finding the first 0 bit in a huge bitmap, a better way would be to have a 
flag in struct task, reset to 0 at exec time.


A new syscall would say : This process is OK to receive *random* fds.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread David M. Lloyd
On Wed, 30 May 2007 14:27:52 -0700 (PDT)
Linus Torvalds [EMAIL PROTECTED] wrote:

 Well, don't think of it as a special case at all: think of bit 30 as
 a the user asked for a non-linear fd.

If the sole point is to protect an fd from being closed or operated on
outside of a certain context, why not just provide the ability to
protect an fd to prevent its use.  Maybe a pair of syscalls like
fdprotect and fdunprotect that take an fd and an integer key.
Protected fds would return EBADF or something if accessed.  The same
integer key must be provided to fdunprotect in order to gain access
to it again.  Then glibc or valgrind or whatever would just unprotect
the fd before operating on it.

- DML
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread William Lee Irwin III
On Wed, May 30, 2007 at 02:27:52PM -0700, Linus Torvalds wrote:
 Well, don't think of it as a special case at all: think of bit 30 as a 
 the user asked for a non-linear fd.
 In fact, to make it effective, I'd suggest literally scrambling the low 
 bits (using, for example, some silly per-boot xor value to to actually 
 generate the true index - the equivalent of a really stupid randomizer). 
 That way you'd have the legacy linear space, and a separate non-linear 
 space where people simply *cannot* make assumptions about contiguous fd 
 allocations. There's no special case there - it's just an extension which 
 explicitly allows us to say if you do that, your fd's won't be allocated 
 the traditional way any more, but you *can* mix the traditional and the 
 non-linear allocation.

One could always stuff a seed or per-cpu seeds in the files_struct and
use a PRNG. The only trick would be cacheline bounces and/or space
consumption of seeds. Another possibility would be bitreversed
contiguity or otherwise a bit permutation of some contiguous range,
modulo (of course) the high bit used to tag the randomized range.

With truly random/sparse fd numbers it may be meaningful to use a
different data structure from a bitmap to track them in-kernel, though
xor and other easily-computed mappings to/from contiguous ranges won't
need such in earnest.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread Matt Mackall
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote:
 Which *could* be something as simple as saying bit 30 in the file 
 descriptor specifies a separate fd space along with some flags to make 
 open and friends return those separate fd's. That makes them useless for 
 select() (which assumes a flat address space, of course), but would be 
 useful for just about anything else.

Or.. we could have a method of swizzling in and out an entire FD
array, similar to UML's trick for swizzling MMs.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-30 Thread William Lee Irwin III
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote:
 Which *could* be something as simple as saying bit 30 in the file 
 descriptor specifies a separate fd space along with some flags to make 
 open and friends return those separate fd's. That makes them useless for 
 select() (which assumes a flat address space, of course), but would be 
 useful for just about anything else.

On Wed, May 30, 2007 at 05:27:15PM -0500, Matt Mackall wrote:
 Or.. we could have a method of swizzling in and out an entire FD
 array, similar to UML's trick for swizzling MMs.

I like that notion even better than randomization. I think it should
happen. I like SKAS, too, of course.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >