date:20001023

Re: Topic for discussion: OS Design

2000-10-23 Thread Gábor Lénárt


On Mon, Oct 23, 2000 at 12:51:16PM -0700, [EMAIL PROTECTED] wrote:
> On Sun, 22 Oct 2000, Dwayne C . Litzenberger wrote:
> > This user also wants a
> > smooth GUI, a mouse pointer that doesn't flinch under load, 
> 
> Try andrea archangeli's VM patches.  When I use those patches X gets much
> smoother and xmms (with nice -5) never skips.  2.2 VM sucks, film at 11.

What the realtion of these patches with Rick's new VM architecture for 2.4.x ?
Will 2.4.x give similar performance you mentioned with andrea's patches ?

- Gabor
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

XFree86 detects a "GeForce DDR" on my Athlon machine. Will CONFIG_FB_RIVA be useful to me?

2000-10-23 Thread Miles Lane



Hi,

My subject line says it all.  I have an Athlon machine
with a GeForce DDR video chipset.  Is there benefit to
my compiling the kernel with CONFIG_FB_RIVA enabled?
The "Help" associated with the option mentions the
TNT series, but not the GeForce series.  Maybe they are
the same?

Miles

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

test10-pre5 -- Compile error drivers/video/video.o: In function `vesafb_set_disp': undefined references

2000-10-23 Thread Miles Lane



I am experimenting with compiling lots of stuff as modules.
I hit what is either a user error, a configuration script
bug or a symbol export bug.

ld -m elf_i386 -T /usr/src/linux/arch/i386/vmlinux.lds -e stext 
arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o \
--start-group \
arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o 
ipc/ipc.o \
drivers/block/block.o drivers/char/char.o drivers/misc/misc.o 
drivers/net/net.o drivers/media/media.o drivers/parport/parport.a  
drivers/ide/idedriver.o drivers/cdrom/cdrom.a drivers/pci/pci.a 
drivers/video/video.o \
net/network.o \
/usr/src/linux/arch/i386/lib/lib.a /usr/src/linux/lib/lib.a 
/usr/src/linux/arch/i386/lib/lib.a \
--end-group \
-o vmlinux
drivers/video/video.o: In function `vesafb_set_disp':
drivers/video/video.o(.text+0x6811): undefined reference to `fbcon_cfb8'
drivers/video/video.o(.text+0x6818): undefined reference to `fbcon_cfb16'
drivers/video/video.o(.text+0x6821): undefined reference to `fbcon_cfb24'
drivers/video/video.o(.text+0x6828): undefined reference to `fbcon_cfb32'

Here is the associated part of .config:

#
# Console drivers
#
CONFIG_VGA_CONSOLE=y
CONFIG_VIDEO_SELECT=y

#
# Frame-buffer support
#
CONFIG_FB=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FB_VESA=y
CONFIG_VIDEO_SELECT=y
CONFIG_FBCON_ADVANCED=y
CONFIG_FBCON_MFB=m
CONFIG_FBCON_CFB2=m
CONFIG_FBCON_CFB4=m
CONFIG_FBCON_CFB8=m
CONFIG_FBCON_CFB16=m
CONFIG_FBCON_CFB24=m
CONFIG_FBCON_CFB32=m
CONFIG_FBCON_VGA=m
CONFIG_FBCON_FONTS=y
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Dan Kegel wrote:
> 
> kqueue lets you associate an arbitrary integer with each event
> specification; the integer is returned along with the event.
> This is very handy for, say, passing the 'this' pointer of the
> object that should handle the event.  Yes, you can simulate it
> with an array indexed by fd, but I like having it included.

I agree. I thin the ID part is the most important part of the interfaces,
because you definitely want to re-use the same functions over and over
again - and you wan tto have some way other than just the raw fd to
identify where the actual _buffers_ for that IO is, and what stage of the
state machine we're in for that fd etc etc.

Also, it's potentially important to have different "id"s for even the same
fd - in some cases you want to have the same event handle both the read
and the write part on an fd, but in other cases it might make more
conceptual sense to separate out the read handling from the write
handling, and instead of using "mask = POLLIN | POLLOUT", you'd just have
two separate events, one with POLLIN and one with POLLOUT.

This was what my "unsigned long id" was, but that is much too hard to use.
See my expanded suggestion of it just a moment ago.

(And yes, I'm sure you can do all this with kevents. I just abhor the
syntax of those things).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Dan Kegel wrote:
> 
> 
>http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html
> describes the FreeBSD kqueue interface for events:

I've actually read the BSD kevent stuff, and I think it's classic
over-design. It's not easy to see what it's all about, and the whole  tuple crap is just silly. Looks much too complicated.

I don't believe in the "library" argument at all, and I think multiple
event queues completely detract from the whole point of being simple to
use and implement.

Now, I agree that my bind_event()/get_event() has limitations, and the
biggest one is probably the event "id". It needs to be better, and it
needs to have more structure. The "id" really should be something that not
only contains the "fd", but also contains the actor function to be called,
along with some opaque data for that function.

In fact, if you take my example server, and move the "handle[id]()" call
_into_ get_events() (and make the "handle[id]()" function pointer a part
of the ID of the event), then the library argument goes away too: it
doesn't matter _who_ calls the get_event() function, because the end
result is always going to be the same regardless of whether it is called
from within a library or from a main loop: it's going to call the function
handle associated with the ID that triggered.

Basically, the main loop would boil down to

for (;;) {
static struct event ev_list[MAXEV];
get_event(ev_list, MAXEV, &tmout);
.. timeout handling here ..
}

because get_even() would end up doing all the user-mode calls too (so
"get_event()" is no longer a system call: it's a system call + a for-loop
to call all the ID handler functions that were associated with the events
that triggered).

So the "struct event" would just be:

struct event {
int fd;
unsigned long mask;
void *opaque;
void (*event_fn)(ind fd, unsigned long mask, void *opaque);
}

and there's no need for separate event queues, because the separate event
queues have been completely subsumed by the fact that every single event
has a separate event function.

So now you'd start everything off (assuming the same kind of "listen to
everything and react to it" server as in my previous example) by just
setting

bind_event(sock, POLLIN, NULL, accept_fn);

which basically creates the event inside the kernel, and will pass it to
the "__get_event()" system call through the event array, so the
get_event() library function basically looks like

int get_event(struct event *array, int maxevents, struct timeval *tv)
{
int nr = __get_event(array, maxevents, tv);
int i;
for (i = 0; i < nr; i++) {
array->event_fn(array->fd, array->mask, array->opaque);
array++;
}
return nr;
}

and tell me why you'd want to have multiple event queues any more?

(In fact, you might as well move the event array completely inside
"get_event()", because nobody would be supposed to look at the raw array
any more. So the "get_event()" interface would be even simpler: just a
timeout, nothing more. Talk about simple programming.

(This is also the ideal event programming interface - signals get racy and
hard to handle, while in the above example you can trivially just be
single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
you want to: you multi-thread things by just having multiple threads that
all call "get_event()" on their own).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread H. Peter Anvin


Followup to:  <[EMAIL PROTECTED]>
By author:Dave Zarzycki <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> Maybe I'm missing something, but why do you seperate out fd from the event
> structure. Why not just "int bind_event(struct event *event)"
> 
> The only thing I might have done differently is allow for multiple event
> queues per process, and the ability to add event queues to event
> queues. But these features might not be all that useful in real life.
> 

This could be useful if it doesn't screw up the implementation too
much.

Pretty much, what Linus is saying is the following:

   select()/poll() have one sizable cost, and that is to set up
   and destroy the set of events we want to trigger on.  We would
   like to amortize this cost by making it persistent.

It would definitely be useful for the user to have more than one such
"event set" installed at any one time, so that you can call different
wait_for_event() [or whatever] as appropriate.  However, if that means
we're doing lots of constructing and deconstructing in kernel space,
then we probably didn't gain much.

The other things I think we'd really like in a new interface is an
interface where you can explicitly avoid the "storming hordes" problem
-- if N servers is waiting for the same event, it should be at least
possible to tell the kernel to only wake up one (arbitrarily chosen)
of them, rather than all.

Finally, it would be somewhat nice to have a unified interface for
synchronous and asynchronous notification.  This should be quite
easily doable by adding a call event_notify(event_set,signal) that
causes real-time signal "signal" to be raised (presumably with the
event_set as the argument), when the specified event_set triggers.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel


Dan Kegel wrote:
> [kqueue is] Pretty similar to yours, with the following additions:
> 
> Your proposal seems to only have one stream of available events per
> process.  kqueue() returns a handle to an event queue, and kevent()
> takes that handle as a first parameter.
> 
> [kqueue] uses a single call to both bind (or unbind) an array of fd's
> and pick up new events.  Probably reduces overhead.
> 
> [kqueue] allows you to watch not just sockets, but also plain files
> or directories.  (Hey, haven't we heard people talk about letting apps
> do that under Linux?  What interface does that use?)

I forgot to mention:

kqueue lets you associate an arbitrary integer with each event
specification; the integer is returned along with the event.
This is very handy for, say, passing the 'this' pointer of the
object that should handle the event.  Yes, you can simulate it
with an array indexed by fd, but I like having it included.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP

2000-10-23 Thread Hank Leininger


On 2000-10-23, Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Hardware:
> Dual P-II 400 Mhz
> 128 MB RAM
> 13GB hard drive

> First test was with 2.4.0-test10-pre3.
> Next four tests were with 2.4.0-test10-pre4.
> Final four tests were with 2.2.18-pre17.

Would it be meaningful to run two concurrent LMbench runs on SMP 2.2.18-p17
vs two concurrent runs on 2.4.0-t10-p4 ?

--
Hank Leininger <[EMAIL PROTECTED]> 
  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Kernel 2.2.17 with RedHat 7 Problem !

2000-10-23 Thread Jakub Jelinek

On Mon, Oct 23, 2000 at 12:06:31PM +, David Wragg wrote:
> Gregory Maxwell <[EMAIL PROTECTED]> writes:
> > If 2.96 is broken, I'd appreciate it if you would describe the breakage. 
> 
> As in the RedHat 2.96?  Try compiling the following on RedHat 7.0 x86
> with "gcc -O2" and take a look at the generated code.  Nice, isn't it?
> 
> 
> #include 
> 
> void foo(void)
> {
> struct itimerval iv;
> 
> iv.it_interval.tv_sec = 0;
> iv.it_interval.tv_usec = 25;
> iv.it_value = iv.it_interval;
> 
> setitimer(ITIMER_REAL, &iv, NULL);
> }

Yes, this is a bug in the compiler (which I hope to fix today, CVS gcc is
broken as well), though the actual place which causes this to be miscompiled
is in the system headers where a restrict keyword is used on an incomplete
struct timeval forward definitions pointer and due to bug is set in the type
structure itself (at least that's my guess, need to run it under debugger
today - but if the select prototype is moved after the full struct timeval
definition, everything works correctly). Note that gcc 2.95.2 has some
restrict keyword related bugs as well (which glibc had to work around in
the headers; the bug was in 2.95.x only), it is not just 2.96.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dave Zarzycki

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> where you say "I want an array of pending events, and I have an array you
> can fill with up to 'maxnr' events - and if you have no events for me,
> please sleep until you get one, or until 'tmout'".
> 
> The above looks like a _really_ simple interface to me. Much simpler than
> either select() or poll(). Agreed?

Totally. I've been wanting an API like this in user-space for a long time.

One question about "int bind_event(int fd, struct event *event)"

Maybe I'm missing something, but why do you seperate out fd from the event
structure. Why not just "int bind_event(struct event *event)"

The only thing I might have done differently is allow for multiple event
queues per process, and the ability to add event queues to event
queues. But these features might not be all that useful in real life.

Hmmm... It might be cute if you could do something like this:

int num_of_resulting_events;
int r, q = open("/dev/eventq");
struct event interested_events[1024];
struct event events_that_happened[1024];

/* fill up interested_event_array */

write(q, interested_events, sizeof(interested_events));
r = read(q, events_that_happened, sizeof(events_that_happened));
num_of_resulting_events = r / sizeof(struct event);

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

On Mon, 23 Oct 2000, Linus Torvalds wrote: 
> Here's a suggested "good" interface that would certainly be easy to 
> implement, and very easy to use, with none of the scalability issues that 
> many interfaces have.  ...
> It boils down to one very simple rule: dense arrays of sticky status 
> information is good. So let's design a good interface for a dense array. 
> 
> Basically, the perfect interface for events would be 
> 
> struct event { 
> unsigned long id; /* file descriptor ID the event is on */ 
> unsigned long event; /* bitmask of active events */ 
> }; 
> 
> int get_events(struct event * event_array, int maxnr, struct timeval 
>*tmout); 
>
> int bind_event(int fd, struct event *event); 

http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html
describes the FreeBSD kqueue interface for events:

 struct kevent {
 uintptr_t ident;/* identifier for this event */
 short filter;   /* filter for event */
 u_short   flags;/* action flags for kqueue */
 u_int fflags;   /* filter flag value */
 intptr_t  data; /* filter data value */
 void  *udata;   /* opaque user data identifier */
 };

 int kqueue(void);

 int kevent(int kq, const struct kevent *changelist, int nchanges,
 struct kevent *eventlist, int nevents,
 const struct timespec *timeout);

Pretty similar to yours, with the following additions:

Your proposal seems to only have one stream of available events per
process.  kqueue() returns a handle to an event queue, and kevent()
takes that handle as a first parameter.

Their proposal uses a single call to both bind (or unbind) an array of fd's
and pick up new events.  Probably reduces overhead.

Their proposal allows you to watch not just sockets, but also plain files
or directories.  (Hey, haven't we heard people talk about letting apps
do that under Linux?  What interface does that use?)

> The really nice part of the above is that it's trivial to implement. It's 
> about 50 lines of code, plus some simple logic to various drivers etc to 
> actually inform about the events. The way to do this simply is to limit it 
> in very clear ways, the most notable one being simply that there is only 
> one event queue per process (or rather, per "struct files_struct" - so 
> threads would automatically share the event queue). This keeps the 
> implementation simple, but it's also what keeps the interfaces simple: no 
> queue ID's to pass around etc. 

I dislike having only one event queue per process.  Makes it hard to use
in a library.  ("But wait, *I* was going to use bind_event()!")

> Advantage: everything is O(1), except for "get_event()" which is O(n) 
> where 'n' is the number of active events that it fills in. 

This is a big win, and is exactly the payoff that Solaris gets with /dev/poll.

> Example "server": 

See  http://www.monkeys.com/kqueue/echo.c for a very similar example server
for kqueue() / kevent().

> You get the idea. Very simple, and looks like it should perform quite 
> admirably. With none of the complexities of signal handling, and none of 
> the downsides of select() or poll(). Both of the new system calls would be 
> on the order of 20-30 lines of code (along with the infrastructure to 
> extend the fasync stuff to also be able to handle events) 

Go, Linus, go!  Let's do it!  But don't go *too* simple.  I really would
like a few of the things kqueue has.

> Yes, I'm probably missing something. But this is the kind of thing I think 
> we should look at (and notice that you can _emulate_ this with poll(), so 
> you can use this kind of interface even on systems that wouldn't support 
> these kinds of events natively). 

- Dan

p.s. my collection of notes on kqueue is at 
http://www.kegel.com/c10k.html#nb.kqueue
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro




On Mon, 23 Oct 2000, Linus Torvalds wrote:

> Note that if there really are only 9 "nopage" routines, then it is a lot
> easier to just add the single "SetPageUptodate(page)" into those 9
> routines, and thus let the VM know of the race.

Works for me. And yes, the list of ->nopage instances that can return
success is that short:
drm_vm_shm_nopage
drm_vm_shm_nopage_lock
drm_vm_dma_nopage
sgi_graphics_nopage
via_mm_nopage
ncp_file_mmap_nopage
shm_nopage
shmzero_nopage
filemap_nopage
- sorry, it's not 9+filemap_nopage, it's 8+filemap_nopage.

However, it still leaves a window for the race: we invalidate first and
remove from pagetables later. And ClearPageUptodate() is obviously
truncate_inode_pages() work. Grr...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Tue, 24 Oct 2000, Alexander Viro wrote:
> 
> It's not the only problem, but I would feel _much_ safer if pagefault
> wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the
> insertion into page tables _within_ ->nopage()?

NO!

We used to do this a LOONG time ago.

Distributing the VM information on how to properly insert a pte into the
page table into drivers is _NOT_ something I want to do again. It was a
pain to get it properly modularized, we're not going back to the horror.

Note that if there really are only 9 "nopage" routines, then it is a lot
easier to just add the single "SetPageUptodate(page)" into those 9
routines, and thus let the VM know of the race.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Jordan Mendelson wrote:

> What you describe is exactly what the /dev/poll interface patch from the
> Linux scalability project does.
> 
> It creates a special device which you can open up and write
> add/remove/modify entries you wish to be notified of using the standard
> struct pollfd. Removing entries is done by setting the events in a
> struct written to the device to POLLREMOVE.

And that's an ugly crap.

* you add struct {} into the kernel API. _Always_ a bad
idea.
* you either create yet another example of "every open() gives a
new instance" kind of device or you got to introduce a broker process.
* no easy way to check the current set.
* no fscking way to use that from scripts/etc.

> You can optionally mmap() memory which the notifications are written to.
> Two ioctl() calls are provide for the initial allocation and also to
> force it to check all items in your poll() list.

* useless use of ioctl() award

> Solaris has this same interface minus the mmap()'ed memory.

Oh, yes. Solaris. Great example of good taste...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > What is your favourite interface then ? 
> 
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer. 

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

struct event {
unsigned long id;   /* file descriptor ID the event is on */
unsigned long event;/* bitmask of active events */
};

int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

 - a queue head in "struct files_struct", initially empty.

 - doing a "bind_event()" basically adds a fasync entry to the file
   structure, but rather than cause a signal, it just looks at whether the
   fasync entry is already linked into the event queue, and if it isn't
   (and the event is one of the ones in the event bitmask), it adds itself
   to the event queue.

 - get_event() just traverses the event queue and fills in the array,
   removing them from the event queue. End of story. If the event queue is
   empty, it trivially sees that in a single line of code (+ timeout
   handling)

Advantage: everything is O(1), except for "get_event()" which is O(n)
where 'n' is the number of active eve

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson

Linus Torvalds wrote:
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> >
> > I don't see the problem. You have the poll table allocated in the kernel,
> > the drivers directly change it and the user mmaps it (I was not proposing
> > to let poll make a kiobuf out of the passed array)

> Th eproblem with poll() as-is is that the user doesn't really tell the
> kernel explictly when it is changing the table..

What you describe is exactly what the /dev/poll interface patch from the
Linux scalability project does.

It creates a special device which you can open up and write
add/remove/modify entries you wish to be notified of using the standard
struct pollfd. Removing entries is done by setting the events in a
struct written to the device to POLLREMOVE.

You can optionally mmap() memory which the notifications are written to.
Two ioctl() calls are provide for the initial allocation and also to
force it to check all items in your poll() list.

Solaris has this same interface minus the mmap()'ed memory.

Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> 
> 
> On Mon, 23 Oct 2000, Alexander Viro wrote:
> > 
> > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
> > analysis had been done? Petr, can you reproduce the problem on -test7?
> 
> I don't think that is it - that code looks very straightforward (and is
> needed on some silly architectures that cannot easily otherwise see if
> they need to be coherent wrt user space - mainly sparc and virtual
> caches).

OK, I see where the race can happen. Yes, vmtruncate() tries to kill the
mappings. Right. However, it does that _after_ truncate_inode_pages().
And there is a window when the only lock we are holding is ->i_sem.
sync_pte in that window ==> we are fucked.

So the question being: WTF do we postpone zapping the page tables until
after the truncate_inode_pages()? The following rules might make life
simpler, AFAICS:
* as soon as ->i_size is set, no new pagetable references to
off-limits pages can appear.
* as soon as we are don with vmtruncate_list() there is no
pagetable references.
* truncate_inode_pages() never has to deal with pages refered from
pagetables.
* ->i_size can't increase until we return from vmtruncate().

It's not the only problem, but I would feel _much_ safer if pagefault
wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the
insertion into page tables _within_ ->nopage()? Look: let's take the tail
of do_no_page() into helper function and just call it from the end of
every bloody ->nopage() out there. It _is_ easy: we have only 9 instances
in the tree not counting filemap_nopage(). Moreover,
do_anonymous_page() will become symmetrical to the rest of the crowd.
Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

USB Printer, in 2.4.0-test9

2000-10-23 Thread Benson Chow


Strange things here.

I'm testing out 2.4.0-test9 kernel with USB, (reiserfs built in, but,
hopefully this has nothing to do with it).

Hardware is a 440FX Dual PPro200/Natoma + 82371SB PIIX3/USB.
Printer: HP DeskJet 880C USB/Parallel
Mouse: Microsoft Intellimouse with Intellieye USB
Other stuff: Belkin "Macintosh" USB hub

Symptoms:

USB Mouse appears to work totally fine.

USB Printer will print a few bytes, and suddenly print garbage
(bitmap/pcl). I tried printing out some plain text and it's MANGLING bits
- corrupting random bytes of data.  The general structure of bytes are
still there, but the resultant printout is gibberish.  (source print file
is a text file printed via "cat file >/dev/usblp0" (which is device
180,0))

I get a bunch of form feeds too but it continues to print a few characters
fine and some that are totally wrong.  It looks like it's corrupting about
5% of the characters, including some high bit 7 characters.

Any ideas what's going on, and is this repeatable by anyone else?  Bad
hardware?  This SMP box only supports one form of the MPS, and I'm not
sure how to tell the difference... I also tried using the printer w/o the
hub, and same results...

Thanks!

-bc

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: serial problems

2000-10-23 Thread Theodore Y. Ts'o

   Date:Sat, 21 Oct 2000 23:45:58 +0200
   From: octave klaba <[EMAIL PROTECTED]>

   I use the serial cart with 5.03 / 2.2.17
   Serial driver version 5.03 (2000-08-11) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled
   00:0d.0 Serial controller: Timedia Technology Co Ltd: Unknown device 7168 (rev 01)

   I realized it can crash easy a system:
   run kermit on a ttySX to an another box.
   when you get the terminal just disconnect the cable 
   and connect it on a another serial port (without any output)
   now quit kermit: your system should be crashed.

Can you actually give me some details of how your system "crashed"?  It
certainly shouldn't have.  Kermit will sometimes hang waiting for the
terminal to flush if it's enabled hardware flow control and there are
characters pending to be flushed.  ^Z will generally break out of the
waiting loop, at which point you can kill the kermit process with a kill
command.  ^C will also break you out back to the kermit prompt, at which
point a second "quit" command will generally work.  

   another bug (known bug I think) is: if you use
   a serial cable not connected, your system crashs
   on the boot.

Huh?  I have several machines without a serial cable, and it certainly
doesn't crash on boot.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Tue, 24 Oct 2000, Andi Kleen wrote:
> 
> I don't see the problem. You have the poll table allocated in the kernel,
> the drivers directly change it and the user mmaps it (I was not proposing
> to let poll make a kiobuf out of the passed array) 

That's _not_ how poll() works at all.

We don't _have_ a poll table in the kernel, and no way to mmap it. The
poll() tables gets created dynamically based on the stuff that the user
has set up in the table. And the user can obviously change the fd's etc in
the table directly, so in order for the caching to work you need to do
various games with page table dirty or writable bits, or at least test
dynamically whether the poll table is the same as it was before.

Sure, it's doable, and apparently Solaris does something like this. But
what _is_ the overhead of the Solaris code for small number of fd's? I bet
it really is quite noticeable. I also suspect it is very optimized toward
an unchangning poll-table.

> What is your favourite interface then ? 

I suspect a good interface that can easily be done efficiently would
basically be something where the user _does_ do the equivalent of a
read-only mmap() of poll entries - and explicit and controlled
"add_entry()" and "remove_entry()" controls, so that the kernel can
maintain the cache without playing tricks.

Basically, something like a user interface to something that looks like
the linux poll_table_page structures, with the difference being that it
doesn't have to be created and torn down all the time because the user
would explicitly ask for "add this fd" and "remove this fd" from the
table.

Th eproblem with poll() as-is is that the user doesn't really tell the
kernel explictly when it is changing the table..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

Nick Piggin ([EMAIL PROTECTED]) wrote:
> > I'm trying to write a server that handles 1 clients. On 2.4.x, 
> > the RT signal queue stuff looks like the way to achieve that. 
> 
> I would suggest you try multiple polling threads. Not only will you get 
> better SMP scalability, if you have say 16 threads, each one only has to 
> handle ~ 600 fds. 

Good point.  My code is already able to use multiple network threads,
and I have done what you suggest in the past.

But I'm interested in pushing the state of the art here,
and want to see if Linux can handle it with just a single
network thread.  (My server has enough non-network threads to
keep multiple CPUs busy, don't worry :-)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel


David Schwartz wrote:
> > I'm trying to write a server that handles 1 clients.  On 2.4.x,
> > the RT signal queue stuff looks like the way to achieve that.
> > Unfortunately, when the RT signal queue overflows, the consensus seems
> > to be that you fall back to a big poll().   And even though the RT signal
> > queue [almost] never overflows, it certainly can, and servers have to be
> > able to handle it.
> 
> Don't let that bother you. In the case where you get a hit a significant
> fraction of the descriptors you are polling on, poll is very efficient. The
> inefficiency comes when you have to wade through 10,000 uninteresting file
> descriptors to find the one interesting one. If the poll set is rich in
> ready descriptors, there is little advantage to signal queues over poll
> itself.
> 
> In fact, if you assume the percentage of ready file descriptors (as opposed
> to the number of file descriptors) is constant, then poll is just as
> scalable (theoretically) as any other method. Under both schemes, with twice
> as many file descriptors you have to do twice as much work.

Yep, I've made similar arguments myself.  It's just that seeing
poll() take 14 milliseconds to return on a 650 MHz system is a little daunting.
I'll report again when I have results for RT signal stuff
and different percentages of idle sockets (probably 0, 1, and 10).

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
> analysis had been done? Petr, can you reproduce the problem on -test7?

I don't think that is it - that code looks very straightforward (and is
needed on some silly architectures that cannot easily otherwise see if
they need to be coherent wrt user space - mainly sparc and virtual
caches).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[drm:drm_release] ERROR Process 256 dead

2000-10-23 Thread Burton Windle


Sorry if this is user-error, but after about 20min of using
2.4.0-test10-pre5, my Debian Woody system dropped out of X with this
message in syslog:

[drm:drm_release] *ERROR* Process 256 dead, freeing lock for context 1

I've never seen this before; I had been using test10-pre4 for several days
without error.

Is this a kernel bugglet/error, or X?

-- 
Burton Windle   [EMAIL PROTECTED]
Linux: the "grim reaper of innocent orphaned children."
  from /usr/src/linux/init/main.c:1384

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro




Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what
analysis had been done? Petr, can you reproduce the problem on -test7?
Unfortunately, clean test would take the backport of ext2 changes
(truncate-related, happened around the same time), but IIRC -test7 was
relatively stable...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson

Dan Kegel wrote:
> 
> Jordan Mendelson ([EMAIL PROTECTED]) wrote:
> > An implementation of /dev/poll for Linux already exists and has shown to
> > be more scalable than using RT signals under my tests. A patch for 2.2.x
> > and 2.4.x should be available at the Linux Scalability Project @
> > http://www.citi.umich.edu/projects/linux-scalability/ in the patches
> > section.
> 
> If you'll look at the page I linked to in my original post,
>   http://www.kegel.com/dkftpbench/Poller_bench.html
> you'll see that I also benchmarked /dev/poll.

The Linux /dev/poll implementation has a few "non-standard" features
such as the ability to mmap() the poll structure memory to eliminate a
memory copy.

int dpoll_fd;
unsigned char *dpoll;
struct pollfd *mmap_dpoll;

dpoll_fd = open("/dev/poll", O_RDWR, 0);
ioctl(dpoll_fd, DP_ALLOC, 1);
dpoll = mmap(0, DP_MMAP_SIZE(1), PROT_WRITE|PROT_READ,
MAP_SHARED,   dpoll_fd, 0);

dpoll = (struct pollfd *)mmap_dpoll;

Use this memory when reading and write() to add/remove and see if you
get any boost in performance. Also, I believe there is a hash table
associated with /dev/poll in the kernel patch which might slow down your
performance tests when it's first growing to resize itself.

Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Getting started

2000-10-23 Thread Avery Fay


Hello,

I'm a senior in high school with a fair bit of programming experience
looking to get involved with the linux kernel. I'm currently reading a book
on OS design and last summer I hacked the IP masquerading and port
forwarding code for an internal company project so I'm not completely
unknowledgable. What's a good project to start with? I've heard that device
drivers are a fairly easy starting point. Unfortunately, all of my hardware
already works with linux. Is there a list of hardware that does not work
with linux anywhere? Can someone suggest an elegant, fairly simply device
driver that I can take a look at?

I also know a little bit of TCP/IP networking. Are there any small features
currently missing from Linux's TCP/IP code?

No need to reply directly to me, I'm subscribed.

Thanks in advance,
Avery Fay





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

RE: Linux's implementation of poll() not scalable?

2000-10-23 Thread David Schwartz



> I'm trying to write a server that handles 1 clients.  On 2.4.x,
> the RT signal queue stuff looks like the way to achieve that.
> Unfortunately, when the RT signal queue overflows, the consensus seems
> to be that you fall back to a big poll().   And even though the RT signal
> queue [almost] never overflows, it certainly can, and servers have to be
> able to handle it.

Don't let that bother you. In the case where you get a hit a significant
fraction of the descriptors you are polling on, poll is very efficient. The
inefficiency comes when you have to wade through 10,000 uninteresting file
descriptors to find the one interesting one. If the poll set is rich in
ready descriptors, there is little advantage to signal queues over poll
itself.

In fact, if you assume the percentage of ready file descriptors (as opposed
to the number of file descriptors) is constant, then poll is just as
scalable (theoretically) as any other method. Under both schemes, with twice
as many file descriptors you have to do twice as much work.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Netscape mail sometimes frezze when i read mail stored in a vfat partition

2000-10-23 Thread Cristofer Velloso


 
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


[1.] One line summary of the problem:

Netscape mail sometimes frezze when i read mail stored in a vfat
partition

[2.] Full description of the problem/report:

When it happen the dmesg show as a kernel bug, and netscape got Z state:
  213 tty1 Z  0:00 [netscape ]

I use netscape with nsmail linked with a Maildir of windows version of
netscape to use the same folders for linux and windows. It happen with
kernel-2.4.0test8 .. and i upgrade to test9 because i see some updates
and
file.c but in ext2 fs not in vfat fs ...


 
[3.] Keywords (i.e., modules, networking, kernel):

 vfat,netscape,file.c

[4.] Kernel version (from /proc/version):

Linux version 2.4.0-test9 (root@soft) (gcc version egcs-2.91.66
19990314/Linux (egcs-1.1.2 release)) #1 sex out 20 12:23:54 BRST 2000

[5.] Output of Oops.. message (if applicable) with symbolic information
 resolved (see Documentation/oops-tracing.txt)

kernel BUG at file.c:79!
invalid operand: 
CPU:0
EIP:0010:[]
EFLAGS: 00013296
eax: 0019   ebx:    ecx: c3e793e0   edx: 0005
esi: 0020   edi: c170d800   ebp: c0e882c0   esp: c2d89e54
ds: 0018   es: 0018   ss: 0018
Process netscape (pid: 198, stackpage=c2d89000)
Stack: c0294645 c02947c7 004f c015f868 0200 0004 c2d89eb8
0010 
   c0131f16 c170d800 0020 c0e882c0 0001 c103e1c4 
0004 
   1000  c0e52ea0 c2d89eb8 c0e882c0 c0e9d000 0200
0020 
Call Trace: [] [] [] []
[]
[] [] 
   [] [] [] [] []
[] [] [] 
Code: 0f 0b 83 c4 0c 66 8b 47 20 66 89 45 0c 89 5d 04 80 4d 18 30 

[6.] A small shell script or example program which triggers the
 problem (if possible)
Netscape 4.75

[7.] Environment

[7.1.] Software (add the output of the ver_linux script here)

Distribution: Slackware 7.1 
Netscape 4.75

Linux soft 2.4.0-test9 #1 sex out 20 12:23:54 BRST 2000 i586 unknown
Kernel modules 2.3.13
Gnu C  egcs-2.91.66
Gnu Make   3.79
Binutils   2.9.1.0.25
Linux C Library2.1.3
Dynamic linker ldd: version 1.9.9
Procps 2.0.6
Mount  2.10l
Net-tools  1.55
Kbd0.99
Sh-utils   2.0
Modules Loaded 

[7.2.] Processor information (from /proc/cpuinfo):

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 5
model   : 8
model name  : AMD-K6(tm) 3D processor
stepping: 0
cpu MHz : 300.000689
cache size  : 64 KB
fdiv_bug: no
hlt_bug : no
sep_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr mce cx8 sep mmx 3dnow
bogomips: 599.65

[7.3.] Module information (from /proc/modules):

lp  5308   2 (autoclean)

[7.4.] Loaded driver and hardware information (/proc/ioports,
/proc/iomem)

/proc/ioports 

-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
0213-0213 : isapnp read
0220-022f : soundblaster
02f8-02ff : serial(auto)
0330-0333 : MPU-401 UART
0376-0376 : ide1
03bc-03be : parport0
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0620-0623 : sound driver (AWE32)
0a20-0a23 : sound driver (AWE32)
0a79-0a79 : isapnp write
0cf8-0cff : PCI conf1
0e20-0e23 : sound driver (AWE32)
5c20-5c3f : Acer Laboratories Inc. [ALi] M7101 PMU
b400-b40f : Acer Laboratories Inc. [ALi] M5229 IDE
  b400-b407 : ide0
  b408-b40f : ide1
b800-b81f : Realtek Semiconductor Co., Ltd. RTL-8029(AS)
  b800-b81f : NE2000
d000-dfff : PCI Bus #01
  d800-d8ff : 3Dfx Interactive, Inc. Voodoo Banshee

/proc/iomem
-0009 : System RAM
000a-000b : Video RAM area
000c-000c7fff : Video ROM
000f-000f : System ROM
0010-07ffbfff : System RAM
  0010-002f195f : Kernel code
  002f1960-0031aae3 : Kernel data
07ffc000-07ffefff : ACPI Tables
07fff000-07ff : ACPI Non-volatile Storage
de00-dfff : PCI Bus #01
  de00-dfff : 3Dfx Interactive, Inc. Voodoo Banshee
e000-e3ff : Acer Laboratories Inc. [ALi] M1541
e5f0-e7ff : PCI Bus #01
  e600-e7ff : 3Dfx Interactive, Inc. Voodoo Banshee
- : reserved


[7.5.] PCI information ('lspci -vvv' as root)

00:00.0 Host bridge: Acer Laboratories Inc. [ALi] M1541 (rev 04)
Subsystem: Acer Laboratories Inc. [ALi] ALI M1541 Aladdin V/V+
AGP System Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=slow >TAbort-
SERR- 
Capabilities: [e0] #00 []

00:01.0 PCI bridge: Acer Laboratories Inc. [ALi] M5243 (rev 04) (prog-if
00 [Normal decode])
Control: I/O+ Me

[PATCH] fs/nls/Config.in -- Again

2000-10-23 Thread Tom Rini


This is the suggestion that Petr made and was approved by the maintainer.
Can we get this in 2.4.0-test10-pre6 please?  Without this patch, if
you have CONFIG_INET turned off, you have to go through the CONFIG_NLS
stuffs.

-- 
Tom Rini (TR1265)
http://gate.crashing.org/~trini/


--- fs/nls/Config.in.orig   Thu Oct 19 12:54:09 2000
+++ fs/nls/Config.inThu Oct 19 12:54:32 2000
@@ -2,10 +2,17 @@
 # Native language support configuration
 #
 
+# smb wants NLS
+if [ "$CONFIG_SMB_FS" = "m" -o "$CONFIG_SMB_FS" = "y" ]; then
+  define_bool CONFIG_SMB_NLS y
+else
+  define_bool CONFIG_SMB_NLS n
+fi
+
 # msdos and Joliet want NLS
 if [ "$CONFIG_JOLIET" = "y" -o "$CONFIG_FAT_FS" != "n" \
-o "$CONFIG_NTFS_FS" != "n" -o "$CONFIG_NCPFS_NLS" = "y" \
-   -o "$CONFIG_SMB_FS" != "n" ]; then
+   -o "$CONFIG_SMB_NLS" = "y" ]; then
   define_bool CONFIG_NLS y
 else
   define_bool CONFIG_NLS n

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Andi Kleen

On Mon, Oct 23, 2000 at 06:42:39PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> > 
> > Also with the poll table mmap'ed via /dev/poll and the optimizations I
> > described poll could be made quite nice (I know that queued SIGIO exists,
> > but it has its drawbacks too and often you need to fallback to poll anyways) 
> 
> The problem is that your proposed optimizations would be horrible: the
> poll events themselves happen when some other process is running, so you'd
> have to do some serious VM hacking and consider the poll table to be some
> kind of direct-IO thing etc. Ugh. 

I don't see the problem. You have the poll table allocated in the kernel,
the drivers directly change it and the user mmaps it (I was not proposing
to let poll make a kiobuf out of the passed array) 

> 
> And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from
> file -> potentially many [process,fd] pairs.

Yes, but in 95% of all cases it is a 1:1 mapping, so you could just
fall back to the old slow method when that isn't the case.
> 
> Doing the caching across multiple poll-calls is even worse, you'd have to
> cache where in user space people had the array etc. Not pretty.

The file -> fdnum reverse table does not depend
On every poll call you can walk the poll table and fix the pointer from file
to poll entry. That can be done cheaply during copyin for normal poll.
mmap'ed /dev/poll could probably use a lazy method (cache the pointers and
verify and walk if it the verify fails) 

> I'm sure it can be speeded up, I'm just not sure it's really worth it if
> you'd instead just have a better interface that wouldn't need this crap,
> and consider poll() as just a compatibility layer.

What is your favourite interface then ? 
ndi
/dev/poll is nice in that it can be relatively easy hacked into older programs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> Also, the fact that Petr didn't see anything trigger in nopage() makes me
> nervous again. Even if the problem happened during read-ahead, it should
> have gotten into the address space only through nopage. Maybe there is
> some vma that isn't added to the right inode VM list - so that we end up
> missing part of the vmtruncate() stuff?
> 
> Al, any ideas?

Just one: let's slow down a bit and try to write down the rules. Close to
release or not, let's understand WTF happens in that part of VM before
deciding on the choice of band-aids/fixes. Frankly, right now I don't feel
that area - too many changes during the last month. So I'm going to sit
down and read it through. IMO it's worth the delay - I have a gut feeling
that band-aids are going to cost more.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

test10-pre5

2000-10-23 Thread Linus Torvalds



Ok, the issue with Petr is still open, in the meantime the test10-pre5
stuff fixes various other small nagging issues (notably the silly
extraneous BUG() tests that bit lots of people - sorry).

Linus

-
 - pre5:
- Mikael Pettersson: more Pentium IV cleanup.
- David Miller: non-x86 platforms missed "pte_same()".
- Russell King: NFS invalidate_inode_pages() can do bad things!
- Randy Dunlap: usb-core.c is gone - module fix
- Ben LaHaise: swapcache fixups for the new atomic pte update code
- Oleg Drokin: fix nm256_audio memory region confusion
- Randy Dunlap: USB printer fixes
- David Miller: sparc updates
- David Miller: off-by-one error in /proc socket dumper
- David Miller: restore non-local bind() behaviour.
- David Miller: wakeups on socket shutdown()
- Jeff Garzik: DEPCA net drvr fixes and CodingStyle
- Jeff Garzik: netsemi net drvr fix
- Jeff Garzik & Andrea Arkangeli: keyboard cleanup
- Jeff Garzik: VIA audio update
- Andrea Arkangeli: mxcsr initialization cleanup and fix
- Gabriel Paubert: better twd_i387_to_fxsr() emulation
- Andries Brouwer: proper error return in ext2 mkdir()

 - pre4:
- disable writing to /proc/xxx/mem. Sure, it works now, but it's still
  a security risk.
- IDE driver update (Victroy66 SouthBridge support)
- i810 rng driver cleanup
- fix sbus Makefile
- named initializers in module..
- ppoe: remove explicit initializer - it's done with initcalls.
- x86 WP bit detection: do it cleanly with exception handling
- Arnaldo Carvalho de Melo: memory leaks in drivers/media/video
- Bartlomiej Zolnierkiewicz: video init functions get __init
- David Miller: get rid of net/protocols.c - they get to initialize themselves
- David Miller: get rid of dev_mc_lock - we hold dev->xmit_lock anyway.
- Geert Uytterhoeven: Zorro (Amiga) bus support update
- David Miller: work around gcc-2.7.2 bug
- Geert Uytterhoeven: mark struct consw's "const".
- Jeff Garzik: network driver cleanups, ns558 joystick driver oops fix
- Tigran Aivazian: clean up __alloc_pages(), kill_super() and
  notify_change()
- Tigran Aivazian: move stuff from .data to .bss
- Jeff Garzik: divert.h typename cleanups
- James Simmons: mdacon using spinlocks
- Tigran Aivazian: fix BFS free block calculation
- David Miller: sparc32 works again
- Bernd Schmidt: fix undefined C code (set/use without a sequence point)
- Mikael Pettersson: nicer Pentium IV setup handling.
- Georg Acher: usb-uhci cpia oops fix
- Kanoj Sarcar: more node_data cleanups for [non]NUMA.
- Richard Henderson: alpha update to new vmalloc setup
- Ben LaHaise: atomic pte updates (don't lose dirty bit)
- David Brownell: ohci memory debugging (== use separate slabs for allocation)

 - pre3:
- update email address of Joerg Reuter
- Andries Brouwer: spelling fixes, missing atari brelse(), breada() fix
- Geert Uytterhoeven: used named initializers for "struct console".
- Carsten Paeth: ISDN capifs - iput() only once.
- Petr Vandrovec: VFAT short name generation fix
- Jeff Garzik: i810_rng cleanup, and i815 chipset added.
- Bartlomiej Zolnierkiewicz: clean up some remaining old-style Makefiles
- Dave Jones: x86 setup fixes (recognize Pentium IV etc).
- x86: do the "fast A20" setup too in setup.S
- NIIBE Yutaka: update SuperH for the global page table (vmalloc) change.
- David Miller: sparc updates (vmalloc stuff still pending)
- David Miller: CodaFS warnings and 64-bit warnings in pci_size()
- David Miller: pcnet32 - correct NULL test
- David Miller: vmlist lock -> page_table_lock clarification
- Trond Myklebust: Ouch. rpcauth_lookup_credcache() memory corruption bug
- Matthew Wilcox: file locking cleanups
- David Woodhouse: USB audio spinlock fixes
- Torben Mathiasen: tlan driver cleanups
- Randy Dunlap: Yenta: CACHE_LINE_SIZE is in dwords, not bytes.
- Randy Dunlap: more USB updates
- Kanoj Sarcar: clean up the NUMA interfaces (pg_data instead of nodes)
- "save_fpu()" was broken. Need to clear pending errors: save_init_fpu().

 - pre2:
- remember to change the kernel version ;)
- isapnp.txt bugfix
- ia64 update
- sparc update
- networking update (pppoe init, frame diverter, fix tcp_sendmsg,
  fix udp_recvmsg).
- Compile for WinChip must _not_ use "-march=i686". It's a i586.
- Randy Dunlap: more USB updates
- clarify the Firewire AIC-5800 situation. It's not supported yet.
- PCI-space decode size fix. This is needed for some (broken?) hardware
- /proc/self/maps off-by-one error
- 3c501, 3c507, cs89x0 network drivers drop unnecessary check_region
- Asahi Kasei AK4540: new codec ID. Yamaha: new PCI ID's.
- ne2k-pci net driver documentation update
- Paul Gortmaker: delete paranoia check in rtc_exit
- scsi_merge: memset

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Tue, 24 Oct 2000, Andi Kleen wrote:
> 
> Also with the poll table mmap'ed via /dev/poll and the optimizations I
> described poll could be made quite nice (I know that queued SIGIO exists,
> but it has its drawbacks too and often you need to fallback to poll anyways) 

The problem is that your proposed optimizations would be horrible: the
poll events themselves happen when some other process is running, so you'd
have to do some serious VM hacking and consider the poll table to be some
kind of direct-IO thing etc. Ugh. 

And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from
file -> potentially many [process,fd] pairs.

Doing the caching across multiple poll-calls is even worse, you'd have to
cache where in user space people had the array etc. Not pretty.

I'm sure it can be speeded up, I'm just not sure it's really worth it if
you'd instead just have a better interface that wouldn't need this crap,
and consider poll() as just a compatibility layer.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> I'm starting to suspect that we leave this path as-is, and just fix the
> mapping case (and PageUptodate() can work there). That should also avoid
> the nasties.

..and even that looks like I'd have to do the quick-and-dirty case with
the race still there on SMP. Adding the page Uptodate logic to the VM
layer proper is too painful at this point - every single nopage function
would have to be updated to mark its page up-to-date, as they don't
generally do that currently.

Also, the fact that Petr didn't see anything trigger in nopage() makes me
nervous again. Even if the problem happened during read-ahead, it should
have gotten into the address space only through nopage. Maybe there is
some vma that isn't added to the right inode VM list - so that we end up
missing part of the vmtruncate() stuff?

Al, any ideas?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Andi Kleen


On Mon, Oct 23, 2000 at 06:16:24PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> > 
> > It would be possible to setup a file -> fdnum reverse table (possibly cached 
> > over poll calls, I think Solaris does that) and let the async events directly 
> > change the bits in the output buffer in O(1).
> 
> I disagree.
> 
> Let's just face it, poll() is a bad interface scalability-wise.

It is, but Linux poll is far from being a good implementation even in the
limitations of the interfaces, as the comparison with Solaris shows. It is
inherently O(n), but the constant part of that in Linux could be certainly 
improved. 

Also with the poll table mmap'ed via /dev/poll and the optimizations I
described poll could be made quite nice (I know that queued SIGIO exists,
but it has its drawbacks too and often you need to fallback to poll anyways) 



> 
> > Also the current 2.4 poll is very wasteful both in memory and cycles
> > for small numbers of fd.
> 
> Yes, we could go back to the "optimize the case of n < 8" thing. Which
> should be particularly easy with the new "poll_table_page" setup: the
> thing is more abstracted than it used to be in 2.2.

I did it for n < 40. It also put the wait queues on the stack, in that 
it is even more efficient than 2.0
I attached the patch for your reference, it is against test3 or so 
(but iirc select/poll has not changed after that). I'm not proposing it
for inclusion right now. 

It does only the easy parts of course, I think with some VFS extensions
it could be made a lot better.

BTW the linux select can be get a lower constant part of n too by simply
testing whole words of the FD_SET for being zero and skipping them quickly
(that makes it a lot cheaper when you have bigger holes in the select fd
space). The patch does that too. 


-Andi



--- linux/fs/select.c   Mon Jul 24 07:39:44 2000
+++ linux-work/fs/select.c  Sat Aug  5 17:23:18 2000
@@ -12,6 +12,9 @@
  *  24 January 2000
  * Changed sys_poll()/do_poll() to use PAGE_SIZE chunk-based allocation 
  * of fds to overcome nfds < 16390 descriptors limit (Tigran Aivazian).
+ *
+ *  July 2000
+ * Add fast poll/select (Andi Kleen)
  */
 
 #include 
@@ -24,6 +27,8 @@
 #define ROUND_UP(x,y) (((x)+(y)-1)/(y))
 #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM)
 
+#define __MAX_POLL_TABLE_ENTRIES ((PAGE_SIZE - sizeof (struct poll_table_page)) / 
+sizeof (struct poll_table_entry))
+
 struct poll_table_entry {
struct file * filp;
wait_queue_t wait;
@@ -32,42 +37,43 @@
 
 struct poll_table_page {
struct poll_table_page * next;
-   struct poll_table_entry * entry;
+   int nr, max; 
struct poll_table_entry entries[0];
 };
 
-#define POLL_TABLE_FULL(table) \
-   ((unsigned long)((table)->entry+1) > PAGE_SIZE + (unsigned long)(table))
-
 /*
- * Ok, Peter made a complicated, but straightforward multiple_wait() function.
- * I have rewritten this, taking some shortcuts: This code may not be easy to
- * follow, but it should be free of race-conditions, and it's practical. If you
- * understand what I'm doing here, then you understand how the linux
- * sleep/wakeup mechanism works.
- *
- * Two very simple procedures, poll_wait() and poll_freewait() make all the
- * work.  poll_wait() is an inline-function defined in ,
- * as all select/poll functions have to call it to add an entry to the
- * poll table.
+ * Tune fast poll/select. Limit is the kernel stack.
  */
+#define FAST_SELECT_LIMIT 40
+#define FAST_POLL_LIMIT   40
+
+#define FAST_SELECT_BYTES ((FAST_SELECT_LIMIT + 7) / 8) 
+#define FAST_SELECT_ULONG \
+   ((FAST_SELECT_BYTES + sizeof(unsigned long) - 1) / sizeof(unsigned long))
+
+static inline void do_freewait(struct poll_table_page *p)
+{
+   struct poll_table_entry * entry;
+   entry = p->entries + p->nr;
+   while (p->nr > 0) {
+   p->nr--;
+   entry--;
+   remove_wait_queue(entry->wait_address,&entry->wait);
+   fput(entry->filp);
+   }
+}
 
 void poll_freewait(poll_table* pt)
 {
+   struct poll_table_page *old;
struct poll_table_page * p = pt->table;
while (p) {
-   struct poll_table_entry * entry;
-   struct poll_table_page *old;
-
-   entry = p->entry;
-   do {
-   entry--;
-   remove_wait_queue(entry->wait_address,&entry->wait);
-   fput(entry->filp);
-   } while (entry > p->entries);
old = p;
+   do_freewait(p); 
p = p->next;
-   free_page((unsigned long) old);
+   if (old->max == __MAX_POLL_TABLE_ENTRIES) { 
+   free_page((unsigned long) old);
+   } 
}
 }
 
@@ -75,7 +81,7 @@
 {
struct poll_table_page *table = p->table;
 
-   if (!table || POLL_TABLE_FULL(table)) {
+   if (!table ||

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Nick Piggin


> I'm trying to write a server that handles 1 clients.  On 2.4.x,
> the RT signal queue stuff looks like the way to achieve that.

I would suggest you try multiple polling threads. Not only will you get
better SMP scalability, if you have say 16 threads, each one only has to
handle ~ 600 fds.

Nick.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Tue, 24 Oct 2000, Andi Kleen wrote:
> 
> It would be possible to setup a file -> fdnum reverse table (possibly cached 
> over poll calls, I think Solaris does that) and let the async events directly 
> change the bits in the output buffer in O(1).

I disagree.

Let's just face it, poll() is a bad interface scalability-wise.

If you want to do efficient events, you should have some other approach,
like an event queue. Yes, I know it's a dirty word because NT uses it, but
let's face it, poll() was a hack to make it easier to do something
select()-like but with the same machinery as select.

> Also the current 2.4 poll is very wasteful both in memory and cycles
> for small numbers of fd.

Yes, we could go back to the "optimize the case of n < 8" thing. Which
should be particularly easy with the new "poll_table_page" setup: the
thing is more abstracted than it used to be in 2.2.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

RAID setup

2000-10-23 Thread Anil kumar


Hi,
  Does Linux give options to choose RAID levels from  

  BIOS?


with regards,
  Anil

__
Do You Yahoo!?
Yahoo! Messenger - Talk while you surf!  It's FREE.
http://im.yahoo.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: make -j 2 broken?

2000-10-23 Thread David Weinehall


On Mon, Oct 23, 2000 at 05:22:22PM -0400, Wakko Warner wrote:
> > Trying to compile the current kernel (test10-pre4) with:
> > 
> > > make clean
> > > make -j 2 bzImages modules modules_install 
> > 
> > will try to install the modules before they are built...
> > This has previously been working (at least in early testX kernels).
> 
> I've done it before when I wasn't thinking, since it parallels those,
> modules_install just calls modules_install.  AFAIK, it doesn't depend on
> modules actually being compiled.
> 
> Typically, I do:
> make -j 20 dep;make -j 20 bzImage modules;make modules_install
> 
> Yes, I've heard about the paralleled dep problem (there was one, right?),
> but it hasn't effected me.

Or even better, to be sure you don't get any unpleasant surprises:

make -j 20 dep && make -j 20 bzImage modules && make modules_install


/David
  _ _
 // David Weinehall <[EMAIL PROTECTED]> /> Northern lights wander  \\
//  Project MCA Linux hacker//  Dance across the winter sky //
\>  http://www.acc.umu.se/~tao/http://www.tux.org/lkml/

Re: IDE-Floppy and devfs

2000-10-23 Thread Joshua Jore


I did report this back a month or so ago. Another work around is to have a
device node (22,0 in my case) around to bang on for a sec. After having
the kernel poke the device, the appropriate devfs node appears
automagically.

My 0.02USD.

Josh

On Sun, 22 Oct 2000, Andreas Franck wrote:

> Hi Paul, hi linux-kernel audience,
> 
> some szggestion for the upcoming 2.4 release ide-floppy driver: It
> really works
> nice without devfs, but the ide-floppy behaviour in connection with
> devfs is
> a bit strange.
> 
> If ide-floppy is compiled as a module, which is loaded (or autoloaded by
> some smart 
> devfs rule) when no disk is in the drive, NO devfs entries are created,
> so there
> is no way to access the drive. Even worse, when a disk is inserted, the
> module is
> still loaded so there is no way to access the disk! Only manual
> unloading and reloading
> of the module will do the trick.
> 
> I'd suggest something like the cdrom approach: There is always one
> device node for
> removable devices, regardless of any media present in the drive. This
> could
> be /dev/ide/host0/bus1/target1/lun0/floppy or something like that. 
> 
> Accessing this file should trigger a probe for the media (which may be
> partitioned
> in ide-floppy devices, which makes life a bit harder). By this probe,
> the 
> /dev/ide/.../lun0/disc and /dev/ide/.../lun0/part4 (or any other
> partitions) 
> might be created, if there is a medium in the drive; a symlink to the
> "right" partition
> (part4 for normal ZIP disks, AFAIK) shhould be placed in a directory of
> its own,
> for example /dev/ide/floppy/c0b1t1u0, and anotherone perhaps in
> /dev/idefloppy/floppy0
> or something.
> 
> I have already implemented the first half of this (creation of the
> floppy node which will
> trigger the partition scan when accessed), I have attached my patch for
> review here.
> Its still quite hackish, but I'm sure you can follow my ideas with what
> I explained above
> - if not, don't hesitate to ask.
> 
> ->- snip -<-
> 
> -- 
> ->>>--- Andreas Franck <<<-
> ---<<< [EMAIL PROTECTED] --->>>---
> ->>> Keep smiling! <<<-
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[OOPS] with 2.2.17+ide+ext3

2000-10-23 Thread Steven Walter



Casually browsing through my system logs, I came upon this two oopses
that happened together (logged as same second).  I don't really remember
what situation was surrounding, or even if any interruption was
experienced.  The system did totally freeze just under 30 minutes later,
however, with no oops logged at that time.  Neither SysRq nor
Ctrl+Alt+Del responded; I had to hit the reset button on the case.
Attached are the two recorded oopses as processed by ksymoops.

If it makes much difference, the oops wasn't run through ksymoops until
after a reboot.


ksymoops 2.3.4 on i586 2.2.17ext3.  Options used
 -V (default)
 -k /proc/ksyms (default)
 -l /proc/modules (default)
 -o /lib/modules/2.2.17ext3/ (default)
 -m /boot/System.map-2.2.17ext3 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Warning (compare_maps): mismatch on symbol V32U96eyeLocation  , pctel says c410b5b0, 
/lib/modules/2.2.17ext3/misc/pctel.o says c41284d4.  Ignoring 
/lib/modules/2.2.17ext3/misc/pctel.o entry
Oops: 
CPU:0
EIP:0010:[sock_poll+26/48]
EFLAGS: 00013293
eax: 9d129dbe   ebx: c12da8c0   ecx: c1f87000   edx: c12da8c0
esi:    edi: 0400   ebp: 000a   esp: c2241ee8
ds: 0018   es: 0018   ss: 0018
Process X (pid: 288, process nr: 28, stackpage=c2241000)
Stack:  0040  0020 c2c43540  c012fa5a c12da8c0 
   c1f87000  1000 0008 0020 c2c43540   
Call Trace: [do_select+274/512] [sys_select+881/1176] [sys_gettimeofday+32/148] 
[system_call+52/56] 
Code: 8b 50 08 51 50 53 8b 42 20 ff d0 83 c4 10 5b 83 c4 18 c3 8d 
Using defaults from ksymoops -t elf32-i386 -a i386

Code;   Before first symbol
 <_EIP>:
Code;   Before first symbol
   0:   8b 50 08  mov0x8(%eax),%edx
Code;  0003 Before first symbol
   3:   51push   %ecx
Code;  0004 Before first symbol
   4:   50push   %eax
Code;  0005 Before first symbol
   5:   53push   %ebx
Code;  0006 Before first symbol
   6:   8b 42 20  mov0x20(%edx),%eax
Code;  0009 Before first symbol
   9:   ff d0 call   *%eax
Code;  000b Before first symbol
   b:   83 c4 10  add$0x10,%esp
Code;  000e Before first symbol
   e:   5bpop%ebx
Code;  000f Before first symbol
   f:   83 c4 18  add$0x18,%esp
Code;  0012 Before first symbol
  12:   c3ret
Code;  0013 Before first symbol
  13:   8d 00 lea(%eax),%eax

Oops: 
CPU:0
EIP:0010:[locks_remove_posix+32/148]
EFLAGS: 00013286
eax: 9d129d12   ebx: c12da8c0   ecx: 9d129d12   edx: c12da8c0
esi: c0024e00   edi:    ebp: 9d129d82   esp: c2241d6c
ds: 0018   es: 0018   ss: 0018
Process X (pid: 288, process nr: 28, stackpage=c2241000)
Stack:  0001 9d129d12  0300 c0126a68 c07753e0 9d129dc6 
   c32f7af8 c0126a8e c07753e0 c0221da4 c0221da3 3246 c2b37660 c0221da3 
   c0221da4 c01133d3 c3def6e0 c32f7ae0 c2242000 c012596b c12da8c0 c2263400 
Call Trace: [fput+32/84] [fput+70/84] [mmput+63/72] [filp_close+83/108] 
[mm_release+16/52] [do_exit+320/672] [do_exit+201/672] Oct 22 23:29:42 hapablap 
kernel:[die+71/72] [error_table+9294/9568] [error_table+9216/9568] 
[do_page_fault+729/992] [error_table+9294/9568] [error_code+45/52] [sock_poll+26/48] 
[alloc_wait+23/152] 
Code: 8b 71 70 85 f6 74 62 90 f6 46 24 01 74 52 8b 44 24 64 39 46 

Code;   Before first symbol
 <_EIP>:
Code;   Before first symbol
   0:   8b 71 70  mov0x70(%ecx),%esi
Code;  0003 Before first symbol
   3:   85 f6 test   %esi,%esi
Code;  0005 Before first symbol
   5:   74 62 je 69 <_EIP+0x69> 0069 Before first symbol
Code;  0007 Before first symbol
   7:   90nop
Code;  0008 Before first symbol
   8:   f6 46 24 01   testb  $0x1,0x24(%esi)
Code;  000c Before first symbol
   c:   74 52 je 60 <_EIP+0x60> 0060 Before first symbol
Code;  000e Before first symbol
   e:   8b 44 24 64   mov0x64(%esp,1),%eax
Code;  0012 Before first symbol
  12:   39 46 00  cmp%eax,0x0(%esi)


2 warnings issued.  Results may not be reliable.

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

Jordan Mendelson ([EMAIL PROTECTED]) wrote:
> An implementation of /dev/poll for Linux already exists and has shown to 
> be more scalable than using RT signals under my tests. A patch for 2.2.x 
> and 2.4.x should be available at the Linux Scalability Project @ 
> http://www.citi.umich.edu/projects/linux-scalability/ in the patches 
> section. 

If you'll look at the page I linked to in my original post,
  http://www.kegel.com/dkftpbench/Poller_bench.html
you'll see that I also benchmarked /dev/poll.

When finding 1 active fd among 1, the Solaris implementation creamed the living 
snot out of the Linux one, even though the Solaris hardware was 5 or so times slower
(see the lmbench results on that page).  Here are the numbers (times in microseconds):

On a 167MHz sun4u Sparc Ultra-1 running SunOS 5.7 (Solaris 7) Generic_106541-11: 

 pipes1001000   1
select151   -   -
  poll470 6763742
 /dev/poll 61  70  92

On an idle 650 MHz dual Pentium III running Red Hat Linux 6.2 with kernel 2.2.14smp 
plus the /dev/poll patch:

 pipes1001000   1 
select 28   -   - 
  poll 23 890   11333 
 /dev/poll 19 1464264 

On the same machine as above, but with vanilla kernel 2.4.0-test10-pre4 smp: 

 pipes1001000   1 
select 52   -   - 
  poll 491184   14660 

> It works fairly well, but I was actually somewhat disappointed to find 
> that it wasn't the primary cause for the system CPU suckage for my 
> particular system. Granted, when you only have to poll a few times per 
> second, the overhead of standard poll() just isn't that bad. 

If you have to poll 100 or fewer sockets, Linux's poll is quite good.
If you have to poll 1000 or so sockets, Linux's /dev/poll works well
(I wish it were available for the current kernels).
But if you have to poll 1 or sockets on Linux, 
neither standard poll nor /dev/poll as currently implemented performs adequately.
I suspect that RT signals on Linux will do much better for N=1 than the 
current /dev/poll, and hope to benchmark that soon.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> With ClearPageDirty() kernel locked up (but no watchdog, so probably
> some livelock) during bootup after fsck /

Actually, it turns out that even with this issue fixed, there's the more
serious issue that the page _has_ to be removed from the page cache once
we get to the point that we're freeing the whole inode (which also calls
truncate_inode_pages).

Which means that we cannot take the easy way out and say "let's delay the
freeing until everything is ok" - even if a buffer is busy due to some
unfortunate timing (so that we turn the page into anonymous buffers), we'd
better get rid of the page from the page cache.

I'm starting to suspect that we leave this path as-is, and just fix the
mapping case (and PageUptodate() can work there). That should also avoid
the nasties.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Andi Kleen

On Mon, Oct 23, 2000 at 11:14:35AM -0700, Linus Torvalds wrote:
> [ Small treatize on "scalability" included. People obviously do not
>   understand what "scalability" really means. ]
> 
> In article <[EMAIL PROTECTED]>,
> Dan Kegel  <[EMAIL PROTECTED]> wrote:
> >I ran a benchmark to see how long a call to poll() takes
> >as you increase the number of idle fd's it has to wade through.
> >I used socketpair() to generate the fd's.
> >
> >Under Solaris 7, when the number of idle sockets was increased from 
> >100 to 1, the time to check for active sockets with poll() 
> >increased by a factor of only 6.5.  That's a sublinear increase in time, 
> >pretty spiffy.
> 
> Yeah. It's pretty spiffy.
> 
> Basically, poll() is _fundamentally_ a O(n) interface. There is no way
> to avoid it - you have an array, and there simply is _no_ known
> algorithm to scan an array in faster than O(n) time. Sorry.

One problem in Linux is that it scans multiple times. At least 4 times
currently: copyin, setup of wait queues, ask for results, copyout. copyin
and copyout are relatively cheap (and could be made even cheaper with
/dev/poll), the problem are the two other passes which involve a lot
of function pointer calls which generally cause pipeline stalls in modern
CPUs.  

It would be possible to setup a file -> fdnum reverse table (possibly cached 
over poll calls, I think Solaris does that) and let the async events directly 
change the bits in the output buffer in O(1). This would save one pass.
It may also be possible to cache the wait queue setup over polls, this
would make poll much cheaper in terms of cache lines used. 

Also the current 2.4 poll is very wasteful both in memory and cycles
for small numbers of fd. I did some experiments with a poll fast path
for small ns by falling back to the 2.0 stack allocation method, and 
it decreased latency dramatically (>-30%). It also saves a lot of memory,
because all the daemons only polling for a few fds wouldn't use two 
additional pages to their stack page [patches are availble, I just didn't
want to submit them during code freeze] 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Kernel 2.2.17 with RedHat 7 Problem !

2000-10-23 Thread Michael Meissner


On Mon, Oct 23, 2000 at 07:48:08PM -0400, David Relson wrote:
> Horst,
> 
> What you say is correct.  Early comments on gcc-2.96 reflected preprocessor 
> changes which made it impossible to compile a kernel.  Later comments, 
> particularly David Wragg's "struct itimerval" example, show that compiler 
> optimizations is broken.
> 
> My recollection is that the behavior of "a[i] = b[i++]" is well defined, 
> i.e. in the standard.  However it's been years since I paid attention to 
> those details, so I may be wrong.

Umm, no, since

 a[i] = b[i++]

does not have a sequence, it is explicitly undefined behavior in the standard.
As I recall Bernd Schmidt recently found a number of places where the above
construct is used in the Linux kernel.

-- 
Michael Meissner, Red Hat, Inc.
PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
Work: [EMAIL PROTECTED]   phone: +1 978-486-9304
Non-work: [EMAIL PROTECTED]   fax:   +1 978-692-4482
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson


Dan Kegel wrote:
> 
> Linus Torvalds wrote:
> > Dan Kegel wrote:
> >> [ http://www.kegel.com/dkftpbench/Poller_bench.html ]
> >> [ With only one active fd and N idle ones, poll's execution time scales
> >> [ as 6N on Solaris, but as 300N on Linux. ]
> >
> > Basically, poll() is _fundamentally_ a O(n) interface. There is no way
> > to avoid it - you have an array, and there simply is _no_ known
> > algorithm to scan an array in faster than O(n) time. Sorry.
> > ...
> > Under Linux, I'm personally more worried about the performance of X etc,
> > and small poll()'s are actually common. So I would argue that the
> > Solaris scalability is going the wrong way. But as performance really
> > depends on the load, and maybe that 1 entry load is what you
> > consider "real life", you are of course free to disagree (and you'd be
> > equally right ;)

> The way I'm implementing RT signal support is by writing a userspace
> wrapper to make it look like an OO version of poll(), more or less,
> with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's.
> When and if I get that working, I may move it into the kernel as an
> implementation of /dev/poll -- and then I won't need to worry about
> the RT signal queue overflowing anymore, and I won't care how scalable
> poll() is.

An implementation of /dev/poll for Linux already exists and has shown to
be more scalable than using RT signals under my tests. A patch for 2.2.x
and 2.4.x should be available at the Linux Scalability Project @ 
http://www.citi.umich.edu/projects/linux-scalability/ in the patches
section.

It works fairly well, but I was actually somewhat disappointed to find
that it wasn't the primary cause for the system CPU suckage for my
particular system. Granted, when you only have to poll a few times per
second, the overhead of standard poll() just isn't that bad.


Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design

2000-10-23 Thread Andi Kleen


On Mon, Oct 23, 2000 at 07:43:24PM -0400, Dennis wrote:
> At 07:19 PM 10/23/2000, Andi Kleen wrote:
> >On Mon, Oct 23, 2000 at 06:43:28PM -0400, Dennis wrote:
> > > - FreeBSD will display kernel print messages with syslogd not running, and
> > > linux will not.
> >
> >Linux will also when the console log level is set high enough (which it
> >is by default, just it is usually too low after you killed klogd).
> >Unqualified printks have level 4, so you need a level > that.
> >Some distributions also unfortunately set bogus defaults or redirect
> >the messages to specific virtual consoles.
> 
> 
> Another linux caveat. Scads of undocumented and virtually undiscoverable 
> behaviours :-)

It is not undocumented. Try reading the klogd manpage.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design

2000-10-23 Thread Andre Hedrick

On Mon, 23 Oct 2000, Hacksaw wrote:

> > Another linux caveat. Scads of undocumented and virtually undiscoverable 
> > behaviours :-)
> 
> Undiscoverable? You have the source code, what more do you want? Start 
> documenting!

Oh no then they would have to publish their findings, and that is only
available in binary format or $500.00 USD and threats for a lawyer.
Regardless that the original code was GPL

I am BATING someone to answer why they are selling GPL code and making
legal gestures if you pay for the GPL code and share the GPL code.

Someone must have a harder time than me reading the rules.

Cheers,

Andre Hedrick
The Linux ATA/IDE guy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

OOM and my .02 cents

2000-10-23 Thread Joe


Hi all, 
I read this week in kernelnotes about the OOM killer and thought
I'd share a few thoughts that I had on the subject.  I know I am
maybe considered a 'nobody' here so my opinion may count very
little, but this makes sense to me as a user so I though I'd
throw in my .02 cents.  If you like it use if not delete this
email and forget about it.

1) If it does not already do this it should probably start with
warnings like printk statements.  (I'll hope it does).

2) There should probably be a way to configure OOM (if there is
not already).  I.E either a % or in bytes before we say we are
out of memory.  This could default to something like 10% of the
remaining memory or 1% or whatever.  This would probably have 2
values.  Start warning messages at 10% start killing at 5% or
something, or one value could be based on the other value.  I'd
prefer a percent basis, cause you wont know how much memory a
system has till the sytem boots and if someone puts in kill at
32Meg and the system has 16 well, you do the math.  This could
possiblly even be configurable through the /proc interface or a
compile time thing or both. 

With these two bits of info it would be fairly easy to write a
user space program that would scan the sys logs for the OOM
warnings and pop up a message in X saying that something needs
to die.  This kind of functionality could be added into X and
then if X sees a warning in the sys logs about OOM then X can
refuse to start another program. This is assuming the X group
wanted to do so. Personally if they did not I'd do it in Gtk or
Xaw or something myself. 
 
3) This should probably be capable of being compiled as a
module, so that if someone decided to add functionality to it
for X it would not make the kernel (bzImage) grow exponentially.
 The reason I say this is that if you look at the trends in
Linux then you'll realize that everything is moving towards X. 
Even IBM's via voice has a GUI. Do you really know where Linux
will have morphed to in 10 years?  It's always better to be
flexable now then to find yourself screwed in the future. 

Even if this were a seperate patch it could possiblly in theory
be done.  

Fact is when that code gets on someones system you don't know
what there needs are or what they'll do with it.   

If you look at windows it pops up a message when you try to
start a program and you don't have the memory for it, someone or
some distribution could do this for Linux if they were so
inclined.  You know there are some people out there that are
into that kind of thing.  

4) Lastly, if this is truely an OOM killer it's hueristics,
should probably just see which progam(s) are taking up the most
memory, and which one(s) were started last.   The last program
that is taking up the most memory should then be killed.  Why?
Chances are that the last program that is started that is taking
over the most memory is probably your problem. init is # 1 in my
process table and the first 5 are kernel related dameons.  X
starts after that even when xdm/gdm are running.  If X had a
memory leak and was killed by the OOM most window mangers will
catch this and save a users settings.  If anyone looses data it
is really there fault for not saving every 5 minutes anyway. 

Lastly the other option rather than killing the program is to
restart the program.  

just my .02 cents.



__
Do You Yahoo!?
Yahoo! Messenger - Talk while you surf!  It's FREE.
http://im.yahoo.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design

2000-10-23 Thread Hacksaw


> Another linux caveat. Scads of undocumented and virtually undiscoverable 
> behaviours :-)

Undiscoverable? You have the source code, what more do you want? Start 
documenting!


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Kernel 2.2.17 with RedHat 7 Problem !

2000-10-23 Thread David Relson

At 05:44 AM 10/23/00, Horst von Brand wrote:
>David Relson <[EMAIL PROTECTED]> said:
>
>Not just a preprocessor change.
>...
>This is true for a correct compiler (ever seen a correct piece of
>software?)  compiling strictly standard-conforming source. The kernel is
>_not_ standard-conforming, and many places are writen just like they are to
>trick the compiler into generating particular code, some places assume that
>undefined behaviour (i.e., a[i] = b[i++] and such) works in a certain way,
>that the compiler pads structures in a certain way, ...
>...
>Yes. The existing program is wrong in that it woprked by chance, not
>because it was written right.

Horst,

What you say is correct.  Early comments on gcc-2.96 reflected preprocessor 
changes which made it impossible to compile a kernel.  Later comments, 
particularly David Wragg's "struct itimerval" example, show that compiler 
optimizations is broken.

My recollection is that the behavior of "a[i] = b[i++]" is well defined, 
i.e. in the standard.  However it's been years since I paid attention to 
those details, so I may be wrong.

Anyhow, as we all know, gcc-2.96 is not ready for prime time.

David

David Relson   Osage Software Systems, Inc.
[EMAIL PROTECTED]   514 W. Keech Ave.
www.osagesoftware.com  Ann Arbor, MI 48103
voice: 734.821.8800fax: 734.821.8800

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design

2000-10-23 Thread Dennis


At 07:19 PM 10/23/2000, Andi Kleen wrote:
>On Mon, Oct 23, 2000 at 06:43:28PM -0400, Dennis wrote:
> > - FreeBSD will display kernel print messages with syslogd not running, and
> > linux will not.
>
>Linux will also when the console log level is set high enough (which it
>is by default, just it is usually too low after you killed klogd).
>Unqualified printks have level 4, so you need a level > that.
>Some distributions also unfortunately set bogus defaults or redirect
>the messages to specific virtual consoles.


Another linux caveat. Scads of undocumented and virtually undiscoverable 
behaviours :-)


db

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design + GPL

2000-10-23 Thread Andre Hedrick



ftp://ftp.etinc.com/pub/linux/linux22_hdlc.tgz

Hi Dennis,

Could explain to me why ET Inc is modifying GPL drivers and then
republishing the binaries as modules only?

Not that it is my sub-system, but I am not sure that my friend Don knows
of this issue.  If Don does not care then, good day.

ls hdlc/usr/hdlc/dev/modules/2.2.14
.   eepro100.o  etbwmgr.o   tulip.o
..  eepro100orig.o  ethdlc.otuliporig.o

Cheers,

Andre Hedrick
The Linux ATA/IDE guy


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: IDE-Floppy and devfs

2000-10-23 Thread Andreas Franck


Hi Paul, hi Richard, hi linux-kernel audience,

Paul Bristow wrote:
> Hi Andreas,
>
> Could you re-send the patch, as it didn't make it with your last
> message?

Sorry, my mailer crashed when I sent this message and I didn't even
notice it got sent :-) So here's what you're waitung for (at least
I hope so). I have done some further cleanups for module initialization
et al., to make it look more like other (newer) drivers I saw elsewhere
in the source tree.
 
So here it comes (attached).

Greetings,
Andreas

PS: Sorry for this extra resend to linux-kernel, I incorrectly tried
to CC linux[EMAIL PROTECTED] im my mail to Paul and Richard,
which didn't
work as I could have expected, if it wasn't that late.

--
->>>--- Andreas Franck <<<-
---<<< [EMAIL PROTECTED] --->>>---
->>> Keep smiling! <<<-

--- linux/drivers/ide/ide-floppy.c.old  Sun Oct  8 16:42:21 2000
+++ linux/drivers/ide/ide-floppy.c  Wed Oct 11 02:15:22 2000
@@ -29,6 +29,15 @@
  * Ver 0.9   Jul  4 99   Fix a bug which might have caused the number of
  *bytes requested on each interrupt to be zero.
  *Thanks to <[EMAIL PROTECTED]> for pointing this out.
+ * Ver 0.10  Oct  8 00   Fix devfs support - now a /dev/ide/.../lun0/floppy
+ *entry, four partition entries (part1..part4), and
+ *symlinks in /dev/ide/floppy are created for each 
+ *IDE floppy device.
+ *   
+ */
+
+/* 
+ * Changed to support devfs registration - 8.10.2000 Andreas Franck <[EMAIL PROTECTED]>
  */
 
 #define IDEFLOPPY_VERSION "0.9"
@@ -48,6 +57,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -236,7 +246,8 @@
idefloppy_capacity_descriptor_t capacity;   /* Last format 
capacity */
idefloppy_flexible_disk_page_t flexible_disk_page;  /* Copy of the 
flexible disk page */
int wp; /* Write protect */
-
+   devfs_handle_t de;  /* devfs entry */
+   
unsigned int flags; /* Status/Action flags */
 } idefloppy_floppy_t;
 
@@ -1526,6 +1537,21 @@
 
 }
 
+/* 
+ * Register drive with devfs
+ */
+static int idefloppy_register (ide_drive_t *drive, int major, int minor)
+{
+   idefloppy_floppy_t *floppy = drive->driver_data;
+
+   floppy->de = devfs_register (drive->de, "floppy", DEVFS_FL_DEFAULT,
+major, minor,
+S_IFBLK | S_IRUGO | S_IWUGO,
+ide_fops, NULL);
+
+   return 0;
+}
+
 /*
  * Driver initialization.
  */
@@ -1560,13 +1586,28 @@
max_sectors[major][minor + i] = 64;
}
 
+
+   idefloppy_register (drive, major, minor);
+
(void) idefloppy_get_capacity (drive);
idefloppy_add_settings(drive);
for (i = 0; i < MAX_DRIVES; ++i) {
ide_hwif_t *hwif = HWIF(drive);
 
if (drive != &hwif->drives[i]) continue;
-   hwif->gd->de_arr[i] = drive->de;
+   hwif->gd->de_arr[i] = drive->de;
+
+   /* 
+*   Stop grok_partitions from building
+*   the devfs "disc" and "part[1..n]"
+*   entries, we'll try to do it a little
+*   bit smarter so partitions exist without
+*   media. If media is inserted, this partition
+*   assumptions are corrected to reflect the actual
+*   partition settings.
+*/
+/* hwif->gd->de_arr[i] = 0; */
+
if (drive->removable)
hwif->gd->flags[i] |= GENHD_FL_REMOVABLE;
break;
@@ -1579,6 +1620,9 @@
 
if (ide_unregister_subdriver (drive))
return 1;
+   
+   devfs_unregister(floppy->de);
+
drive->driver_data = NULL;
kfree (floppy);
return 0;
@@ -1601,24 +1645,24 @@
  * IDE subdriver functions, registered with ide.c
  */
 static ide_driver_t idefloppy_driver = {
-   "ide-floppy",   /* name */
-   IDEFLOPPY_VERSION,  /* version */
-   ide_floppy, /* media */
-   0,  /* busy */
-   1,  /* supports_dma */
-   0,  /* supports_dsc_overlap */
-   idefloppy_cleanup,  /* cleanup */
-   idefloppy_do_request,   /* do_request */
-   idefloppy_end_request,  /* end_request */
-   idefloppy_ioctl,/* ioctl */
-   idefloppy_open, /* open */
-   idefloppy_release,  /* release */
-   idefloppy_media_change, /* media_change */
-   idefloppy_revalidate,   /* media_change */
-   NULL,   /* pre_reset */
-   idefloppy_capacity, /* capac

Re: Kernel 2.2.17 with RedHat 7 Problem !

2000-10-23 Thread Horst von Brand


David Relson <[EMAIL PROTECTED]> said:
> At 09:14 PM 10/22/00, Horst von Brand wrote:
> >Jurgen Kramer <[EMAIL PROTECTED]> said:
> > > You can blame it on the compiler which is included with RH7.0. It's a
> > > pre-release version of some sort. It seems that the gcc people are not
> > > happy that RH included this version with RH7.

> >It is the *kernel's* fault, as far as can be ascertained now. The compiler
> >is stricter, and implements new optimizations, for which the kernel (being
> >only ever compiled with gcc) is just unprepared.

> The problem, as I understand it, is that gcc-2.96 handles language 
> constructs slightly different than older compilers.  This is a preprocessor 
> change, not an optimization problem.

Not just a preprocessor change.

> To say "new optimizations ... kernel ... unprepared" is incorrect.  Having 
> worked with compilers (some years ago), I always took it as an article of 
> faith that the same answer(s) would be generated whether optimization was 
> turned on or not.

This is true for a correct compiler (ever seen a correct piece of
software?)  compiling strictly standard-conforming source. The kernel is
_not_ standard-conforming, and many places are writen just like they are to
trick the compiler into generating particular code, some places assume that
undefined behaviour (i.e., a[i] = b[i++] and such) works in a certain way,
that the compiler pads structures in a certain way, ...

>   Optimization should always be a way to do a task either 
> quicker (fewer instructions executing, less executing time, etc) or shorter 
> (less memory needed for the instructions).  Optimization should never, 
> never give a different result.  Having new optimizations break an executing 
> program is simply wrong.

Yes. The existing program is wrong in that it woprked by chance, not
because it was written right.
-- 
Horst von Brand [EMAIL PROTECTED]
Casilla 9G, Vin~a del Mar, Chile   +56 32 672616

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design

2000-10-23 Thread Andi Kleen

On Mon, Oct 23, 2000 at 06:43:28PM -0400, Dennis wrote:
> - FreeBSD will display kernel print messages with syslogd not running, and 
> linux will not.

Linux will also when the console log level is set high enough (which it
is by default, just it is usually too low after you killed klogd).  
Unqualified printks have level 4, so you need a level > that.
Some distributions also unfortunately set bogus defaults or redirect
the messages to specific virtual consoles.

> - FreeBSD doesnt (seem) to have the buffering problem that linux does, in 
> that exceptionally long messages (like decoding a Frame Relay LMI frame 
> with 1000 elements) work just fine.

You cannot print more than the kernel log buffers size without scheduling
inbetween to let klogd eat some of the buffer.

I don't see a way how FreeBSD could do that better, except if they
found a way to store infinite data in a finite buffer (or alternatively
not store your LMI frame completely in the syslog, which would be also
as bad) It is possible that their default buffer is bigger though. 
Linux's default is 16K, which is a bit on the low side for many things. 
You can increase it of course with a simple recompile by changing the
define in kernel/printk.c

Admittedly one problem in Linux with big printks is that it kills your
interrupt latency completely. There are lower overhead alternatives though.

> -  FreeBSD will display messages immediately without a newline
> - FreeBSD messages 1) can be redirected and 2) are printed without a timestamp.

Both just like in Linux. The timestamps come from syslogd/klogd, not the 
kernel.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: MAP_NR for 2.4

2000-10-23 Thread Ralf Baechle


On Fri, Oct 20, 2000 at 12:38:57PM +0530, [EMAIL PROTECTED] wrote:

> for using MAP_NR with 2.4, i think you can use
> macro like
> 
> #define MAP_NR(addr) (((unsigned long)(addr)-PAGE_OFFSET) >>PAGE_SHIFT)

This only works for contiguous memory.

  Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [linux-lvm] 2.4.0-test10-pre4 oops with LVM

2000-10-23 Thread Ingo Oeser


On Thu, Oct 19, 2000 at 07:07:52PM -0700, Myles Uyema wrote:
> This kernel panic occurred when I attempted to dump my ext2 filesystem
> /home onto /mnt/ide/backup.  My exact dump command was:
> 
> dump -0 -u -M -f /mnt/ide/backup/home -B 2096128 /home
> 
> I believe dump managed to start reading directories before the oops
> occurred...Haven't been able to duplicate the oops on 2.4.0-test9.

Content-Description: 2.4.0-test10pre4 oops log
> Oct 19 17:58:13 uyema kernel: kernel BUG at vmscan.c:102!

Remove lines 101-104 in mm/vmscan.c of linux-2.4.0-test10-pre4.

Someone was a bit too paranoid in MM handling ;-)

MM guys and even Linus would tell you the same[1].

Regards

Ingo Oeser

[1] As they did already for other people facing this problem.
-- 
Feel the power of the penguin - run [EMAIL PROTECTED]
:x
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP

2000-10-23 Thread Chris Evans



On Mon, 23 Oct 2000, Jeff Garzik wrote:

> First test was with 2.4.0-test10-pre3.
> Next four tests were with 2.4.0-test10-pre4.
> Final four tests were with 2.2.18-pre17.
> 
> All are 'virgin' kernels, without any patches.

[...]

I'll take the liberty of highlighting some big changes, v2.2 vs v2.4

*Local* Communication latencies in microseconds - smaller is better
---
Host OS 2p/0K  Pipe AF UDP  RPC/   TCP  RPC/ TCP
ctxsw   UNIX UDP TCP conn
- - - -  - - - - 
rum.normn Linux 2.4.0-t 620   4563   10681   157  146
rum.normn Linux 2.2.18p 212   1856   123   106   159  237

- So we broke pipe/AF UNIX latencies

File & VM system latencies in microseconds - smaller is better
--
Host OS   0K File  10K File  MmapProtPage   
Create Delete Create Delete  Latency Fault   Fault 
- - -- -- -- --  --- -   - 
rum.normn Linux 2.4.0-t 15  1 28  3 1016 10.0K
rum.normn Linux 2.2.18p 16  1 29  2 7658 20.6K

- But gave steroids to mmap latencies

*Local* Communication bandwidths in MB/s - bigger is better
---
HostOS  Pipe AFTCP  File   Mmap  Bcopy  Bcopy  Mem Mem
 UNIX  reread reread (libc) (hand) read write
- -    -- -- -- --  -
rum.normn Linux 2.4.0-t  152  105   98151326138144  326 171
rum.normn Linux 2.2.18p  264  106   55152326137142  326 180

- Mixed fortunes here. A serious boost to TCP bandwidth but pipe bandwidth
dies a bit


Cheers
Chris


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Jesse Pollard

-  Received message begins Here  -

> 
> Jesse Pollard <[EMAIL PROTECTED]> writes:
> 
> > Don't configure syslogd to do reverse lookups.
> 
> Our syslogd has no option to disable the reverse lookups.
> 
> > You can NEVER guarantee that the reverse lookup will succeed, and
> > can be delayed several minutes for a single reply.
> 
> Not true.  The named on our loghost is authoritative for the reverse
> mappings for all of the machines which can log there.

If the server is sufficiently busy, the reply WILL be delayed. It can
even be on the same host, it just may not get around to a reply immediately.
Authority is not equivalent to reply. The advantage of local files is
that the reverse lookup doesn't depend on ANY outside agency.

> > I consider this a configuration error. I don't believe syslogd
> > should ever do a reverse lookup, since the name you are trying to
> > get may never arrive, or if arrives, it may be spoofed.
> 
> There *is* no configuration for these tools which gives the behavior
> you describe, so this is not a "configuration error".

I think it requires a recompile of the syslogd source to make that
change. I grant that there isn't a runtime option for it (I do think
there should be).

> > It's not a bug, but a security feature. NO log to syslogd should be
> > lost, since it may be related to an attack.
> 
> Historically, no other Unix system has had reliable syslogging.  It
> would require very defensive programming for syslogd, and that has
> clearly not been performed.

Some do, some don't - Cray systems will shut down if the audit daemon
disappears. Syslogd on Linux is providing the corresponding facility,
at the current time. Neither syslogd, nor audit daemon are reliable on
SGI systems, trusted solaris shuts down - I believe. The normal solaris
is unknown since I haven't had time to fully configure either it or the
audit daemon yet.

> And if this is what GNU/Linux intends, why does glibc use a SOCK_DGRAM
> socket for communication with syslogd?  By definition, such sockets
> are *unreliable*.  If syslog is supposed to be reliable, a different
> connection type must be used.

I think that is a "don't care" option on AF_UNIX connections. Required
for semantic compatability, but not used.

> 
> Your philosophy that "no syslog message should ever be lost" is not
> necessarily bad.  But it is clearly at odds with historical practice,
> the current glibc syslog() implementation, and the current syslogd
> itself.
> 
> It is true that glibc falls back to using SOCK_STREAM if the
> SOCK_DGRAM connection fails.  Does that mean GNU/Linux is expects
> syslog to be reliable eventually?  If so, then my problem is entirely
> a bug in syslogd and I will report it as such.

I think it is only reliable on the local host. If a remote syslog
facility is used, then it may be dropped if the input queue becomes too
long.

-
Jesse I Pollard, II
Email: [EMAIL PROTECTED]

Any opinions expressed are solely my own.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design

2000-10-23 Thread Dennis


At 04:35 PM 10/23/2000, you wrote:
>On Mon, 23 Oct 2000, Dennis wrote:
>
> > This is typical of the "linux mentality". Why do other OSs have solutions
> > that work, yet linux's method requires special coding? If it "has to be
> > done that way", why do other OS's have solutions that dont do it that way?
> > the size of the buffer is an annoyance but not a serious problem however.
> >
>
>I'm not sure that Linux requires any special coding.
>
> > printing directly to the console (as BSD does) is useful when debugging a
> > panic, as you can trace right to the panic point. Also certain levels of
>
>BSD does not write directly to the console. Its console is a direct
>clone of Linux. I'm not sure which came first, but when you have a single
>screen-card there are not too many ways to get the character and attribute
>into screen memory. Linux allows the console to be redirected to a serial
>port. BSD does not last time I checked.


- FreeBSD will display kernel print messages with syslogd not running, and 
linux will not.
- FreeBSD doesnt (seem) to have the buffering problem that linux does, in 
that exceptionally long messages (like decoding a Frame Relay LMI frame 
with 1000 elements) work just fine.
-  FreeBSD will display messages immediately without a newline
- FreeBSD messages 1) can be redirected and 2) are printed without a timestamp.

which implies that you are wrong about just about everything.

>What? The API has remained consistent since 0.99. It's only internal
>kernel stuff that has changed. If you wrote code that worked on 0.99,
>it will still compile and work on 2.2.17.

We are talking about the kernel. What are you talking about? The external 
view is meaningless.

the device driver interface changed substantially in 2.0. The module 
interface changed, as it is changing again in 2.4. The PCI interface has 
changed and is changing again.


>The only reason to get the 'latest' version is to take advantage
>of increased functionality. This, by definition, means that something
>has changed. That's what you upgrade for. The word "unstable" is a
>misnomer.


Usually my customers want to upgrade because of some security fix or bug 
fix, not to get new features.


> > My point is that there is no "stable kernel series". 2.2.0 wasnt stable,
> > and neither was 2.2.3. Virtually all of the ethernet drivers still lock up
> > under heavy load in 2.2.17...and now with 2.4 there are more countless
> > adventures ahead
>
>Which Ethernet drivers are you having trouble with? The ones that had
>lockup problems (incidentally hardware related), now have reset code
>that runs off a watch-dog.


The eepro100 driver is an ongoing project, still with lockup problems. the 
tulip has issues as well.



> > an example of "poor planning" is that skput and skpush will panic the
> > kernel if there is no room (this can happen with multiple encapsulations)
> > The proper behaviour would be to return a NULL pointer indicating failure,
> > and then to drop the frame and issue a warning.
>
>The proper response to any resource not being available (in networking)
>is to drop the packet on the floor, smash it into little pieces, and
>don't tell anybody about it.
>
>The packet will be sent again. But, if you can't transmit a packet,
>therefore freeing a buffer, what do you do?
>
>What you do is realilize that the failure to transmit was likely
>caused by a disconnected wire. In the drivers I use, I simply pretend
>that every packet got transmitted okay. This usually involves a
>one or two-line modification to the driver.
>
>This has nothing to do with poor planning. It just has to do with
>a design decision that I didn't agree with. Somebody decided that
>network data was precious and therefore the machine should kill itself
>if necessary to get the data through.
>
>I didn't agree with this so I changed a few lines of code. You can't
>kill any of my machines by flooding them and they never lock up.
>Further, they run at 85 to 90 percent of the network physical layer
>bandwidth. My main machine is our domain name-server, it gets between
>2000 and 5000 hits per second. If it crashed, our whole LAN goes
>down. It doesn't. It runs Linux-2.2.17.

Right. So your answer is that linux is OK if you modify all of the broken 
stuff yourself. Im glad we are in agreement on that, if nothing else.
DB

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

Linus Torvalds wrote:
> Dan Kegel wrote:
>> [ http://www.kegel.com/dkftpbench/Poller_bench.html ]
>> [ With only one active fd and N idle ones, poll's execution time scales
>> [ as 6N on Solaris, but as 300N on Linux. ]
>
> Basically, poll() is _fundamentally_ a O(n) interface. There is no way
> to avoid it - you have an array, and there simply is _no_ known
> algorithm to scan an array in faster than O(n) time. Sorry.
> ...
> Under Linux, I'm personally more worried about the performance of X etc,
> and small poll()'s are actually common. So I would argue that the
> Solaris scalability is going the wrong way. But as performance really
> depends on the load, and maybe that 1 entry load is what you
> consider "real life", you are of course free to disagree (and you'd be
> equally right ;)

I don't think I was being as clueless as you feared.
Solaris' poll() somehow manages to skip inactive fd's very efficiently.
But I'm happy to agree that small poll()'s are very important.
I'd prefer to never use big poll()'s.  The RT signal stuff
scales much better.  

I'm trying to write a server that handles 1 clients.  On 2.4.x,
the RT signal queue stuff looks like the way to achieve that.
Unfortunately, when the RT signal queue overflows, the consensus seems 
to be that you fall back to a big poll().   And even though the RT signal 
queue [almost] never overflows, it certainly can, and servers have to be 
able to handle it.

The way I'm implementing RT signal support is by writing a userspace
wrapper to make it look like an OO version of poll(), more or less,
with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's.
When and if I get that working, I may move it into the kernel as an
implementation of /dev/poll -- and then I won't need to worry about 
the RT signal queue overflowing anymore, and I won't care how scalable
poll() is.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Matti Aarnio


On Mon, Oct 23, 2000 at 04:56:26PM -0400, Patrick J. LoPresti wrote:
 
> You are effectively suggesting that named should be rewritten not to
> use the glibc syslog functions at all.  That strikes me as the worst
> suggestion so far; it would be far better for syslogd not to do name
> lookups.  But my syslogd has no option to avoid name lookups; I will
> submit a request to add one.


/etc/nsswitch.conf:
hosts: files dns

/etc/hosts:

ip.of.named.host  name.of.named.host
ip.of.other.host  name.of.other.host


Give explicite IP/name mappings for those which
you don't want to be looked via the resolver.

This is, of course, system-wide, but use it sparingly.

/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> With ClearPageDirty() kernel locked up (but no watchdog, so probably
> some livelock) during bootup after fsck /.

Yeah, the way the truncate logic works right now truncate_whole_page() has
to remove the page from the inode list - otherwise truncate ends up
looping forever trying to truncate that page ;).

And there was a off-by-one error in my first untested version anyway: the
page_count() should be tested against "2", not "1", as the truncate logic
has elevated the count anyway (probably unnecessarily: once we get the
page lock nobody else can race to remove it from the page cache anyway, so
it's not as if it could go away).

> Should I try ClearPageUptodate() instead?

No, I'll have to fix the truncate logic to allow for this all.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

LMbench 2.4.0-test10pre-UP vs. 2.2.18pre-UP

2000-10-23 Thread Jeff Garzik


Hardware:
Single P-III 500 Mhz
64 MB RAM
13GB hard drive

First five tests were with 2.4.0-test10-pre4.
Final four tests were with 2.2.18-pre17.

All are 'virgin' kernels, without any patches.

-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |


 L M B E N C H  2 . 0   S U M M A R Y
 
 (Alpha software, do not distribute)

Basic system parameters

Host OS Description  Mhz

- - --- 
hum.normn Linux 2.4.0-t   i686-pc-linux-gnu  500
hum.normn Linux 2.4.0-t   i686-pc-linux-gnu  500
hum.normn Linux 2.4.0-t   i686-pc-linux-gnu  500
hum.normn Linux 2.4.0-t   i686-pc-linux-gnu  500
hum.normn Linux 2.4.0-t   i686-pc-linux-gnu  500
hum.normn Linux 2.2.18p   i686-pc-linux-gnu  500
hum.normn Linux 2.2.18p   i686-pc-linux-gnu  500
hum.normn Linux 2.2.18p   i686-pc-linux-gnu  500
hum.normn Linux 2.2.18p   i686-pc-linux-gnu  500

Processor, Processes - times in microseconds - smaller is better

Host OS  Mhz null null  open selct sig  sig  fork exec sh  
 call  I/O stat clos   inst hndl proc proc proc
- -      -     
hum.normn Linux 2.4.0-t  500  0.6  0.9  3.5  5.636  1.6  5.3 0.3K 1.4K 8.3K
hum.normn Linux 2.4.0-t  500  0.6  0.8  3.5  5.533  1.6  5.3 0.3K 1.4K 8.2K
hum.normn Linux 2.4.0-t  500  0.6  0.8  3.5  5.533  1.7  5.3 0.3K 1.4K 8.2K
hum.normn Linux 2.4.0-t  500  0.6  0.8  3.5  5.533  1.6  5.3 0.3K 1.5K 8.2K
hum.normn Linux 2.4.0-t  500  0.6  0.8  3.5  5.637  1.7  5.3 0.3K 1.4K 8.2K
hum.normn Linux 2.2.18p  500  0.6  0.9  3.9  5.331  1.7  2.3 0.3K 1.3K 7.8K
hum.normn Linux 2.2.18p  500  0.6  0.8  3.8  5.231  1.7  2.3 0.3K 1.3K 7.8K
hum.normn Linux 2.2.18p  500  0.6  0.9  3.6  5.131  1.7  2.3 0.3K 1.3K 7.8K
hum.normn Linux 2.2.18p  500  0.6  0.9  3.7  5.131  1.7  2.3 0.3K 1.3K 7.8K

Context switching - times in microseconds - smaller is better
-
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - - -- -- -- -- --- ---
hum.normn Linux 2.4.0-t 1 13 41 14103  18 138
hum.normn Linux 2.4.0-t 1 13 41 14 96  15 140
hum.normn Linux 2.4.0-t 1 13 40 14117  21 143
hum.normn Linux 2.4.0-t 1 13 42 14 90  17 135
hum.normn Linux 2.4.0-t 1 13 41 14 94  19 143
hum.normn Linux 2.2.18p 1 13 40 13 89  16 143
hum.normn Linux 2.2.18p 0 13 41 13 89  17 143
hum.normn Linux 2.2.18p 1 13 41 14 82  18 147
hum.normn Linux 2.2.18p 0 13 40 13 82  16 145

*Local* Communication latencies in microseconds - smaller is better
---
Host OS 2p/0K  Pipe AF UDP  RPC/   TCP  RPC/ TCP
ctxsw   UNIX UDP TCP conn
- - - -  - - - - 
hum.normn Linux 2.4.0-t 1 7   12258144   123  155
hum.normn Linux 2.4.0-t 1 7   12268244   124  159
hum.normn Linux 2.4.0-t 1 7   12268144   125  159
hum.normn Linux 2.4.0-t 1 7   12268344   120  155
hum.normn Linux 2.4.0-t 1 7   12268344   123  159
hum.normn Linux 2.2.18p 1 6   13257043   106  152
hum.normn Linux 2.2.18p 0 6   13257043   105  154
hum.normn Linux 2.2.18p 1 6   13257043   106  151
hum.normn Linux 2.2.18p 0 6   14257143   107  151

File & VM system latencies in microseconds - smaller is better
--
Host OS   0K File  10K File  MmapProtPage   
Create Delete Create Delete  Latency Fault   Fault 
- - -- -- -- --  --- -   - 
hum.normn Linux 2.4.0-t 12  1 27 29  305 10.0K
hum.normn Linux 2.4.0-t 12  1 27 31  292 10.0K
hum.normn Linux 2.4.0-t 13  1 28 30  297 10.0K
hum.normn Linux 2.4.0-t 12  1 27 34  286 10.0K
hum.normn Linux

LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP

2000-10-23 Thread Jeff Garzik


Hardware:
Dual P-II 400 Mhz
128 MB RAM
13GB hard drive

First test was with 2.4.0-test10-pre3.
Next four tests were with 2.4.0-test10-pre4.
Final four tests were with 2.2.18-pre17.

All are 'virgin' kernels, without any patches.

-- 
Jeff Garzik| The difference between laziness and
Building 1024  | prioritization is the end result.
MandrakeSoft   |


 L M B E N C H  2 . 0   S U M M A R Y
 
 (Alpha software, do not distribute)

Basic system parameters

Host OS Description  Mhz

- - --- 
rum.normn Linux 2.4.0-t   i686-pc-linux-gnu  401
rum.normn Linux 2.4.0-t   i686-pc-linux-gnu  401
rum.normn Linux 2.4.0-t   i686-pc-linux-gnu  401
rum.normn Linux 2.4.0-t   i686-pc-linux-gnu  401
rum.normn Linux 2.4.0-t   i686-pc-linux-gnu  401
rum.normn Linux 2.2.18p   i686-pc-linux-gnu  401
rum.normn Linux 2.2.18p   i686-pc-linux-gnu  401
rum.normn Linux 2.2.18p   i686-pc-linux-gnu  401
rum.normn Linux 2.2.18p   i686-pc-linux-gnu  401

Processor, Processes - times in microseconds - smaller is better

Host OS  Mhz null null  open selct sig  sig  fork exec sh  
 call  I/O stat clos   inst hndl proc proc proc
- -      -     
rum.normn Linux 2.4.0-t  401  0.8  1.3  6.0  8.997  2.1  4.3 0.6K 2.0K 10.K
rum.normn Linux 2.4.0-t  401  0.8  1.4  5.8  8.377  2.1  4.3 0.6K 2.1K 10.K
rum.normn Linux 2.4.0-t  401  0.8  1.4  5.8  8.474  2.1  4.3 0.6K 2.1K 10.K
rum.normn Linux 2.4.0-t  401  0.8  1.4  5.8  8.371  2.1  4.3 0.6K 2.1K 10.K
rum.normn Linux 2.4.0-t  401  0.8  1.4  5.8  8.471  2.1  4.3 0.6K 2.0K 10.K
rum.normn Linux 2.2.18p  401  0.9  1.3  5.3  7.340  2.3  3.3 0.5K 1.8K 9.8K
rum.normn Linux 2.2.18p  401  0.8  1.3  5.1  7.240  2.2  3.3 0.5K 1.8K 9.9K
rum.normn Linux 2.2.18p  401  0.9  1.3  5.3  7.440  2.3  3.3 0.5K 1.8K 9.9K
rum.normn Linux 2.2.18p  401  0.8  1.3  5.1  7.140  2.2  3.3 0.5K 1.9K 9.9K

Context switching - times in microseconds - smaller is better
-
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - - -- -- -- -- --- ---
rum.normn Linux 2.4.0-t 5 21 55 23106  27 154
rum.normn Linux 2.4.0-t 5 20 55 20117  22 147
rum.normn Linux 2.4.0-t 5 20 72 21116  26 148
rum.normn Linux 2.4.0-t 5 21 55 22114  28 154
rum.normn Linux 2.4.0-t 6 21 55 21119  31 148
rum.normn Linux 2.2.18p 2 18 56 18106  22 158
rum.normn Linux 2.2.18p 2 18 52 18120  26 173
rum.normn Linux 2.2.18p 1 18 54 18110  29 173
rum.normn Linux 2.2.18p 2 18 54 18116  24 169

*Local* Communication latencies in microseconds - smaller is better
---
Host OS 2p/0K  Pipe AF UDP  RPC/   TCP  RPC/ TCP
ctxsw   UNIX UDP TCP conn
- - - -  - - - - 
rum.normn Linux 2.4.0-t 520   4261   10680   160  149
rum.normn Linux 2.4.0-t 523   4363   10582   156  146
rum.normn Linux 2.4.0-t 523   4363   10582   159  147
rum.normn Linux 2.4.0-t 524   4464   10481   156  147
rum.normn Linux 2.4.0-t 620   4563   10681   157  146
rum.normn Linux 2.2.18p 212   1856   123   106   159  237
rum.normn Linux 2.2.18p 212   2064   123   107   159  240
rum.normn Linux 2.2.18p 112   2154   123   107   160  237
rum.normn Linux 2.2.18p 212   2152   124   108   159  236

File & VM system latencies in microseconds - smaller is better
--
Host OS   0K File  10K File  MmapProtPage   
Create Delete Create Delete  Latency Fault   Fault 
- - -- -- -- --  --- -   - 
rum.normn Linux 2.4.0-t 15  1 28  3  954 10.0K
rum.normn Linux 2.4.0-t 15  1 28  3 1001 10.0K
rum.normn Linux 2.4.0-t 15  1 28  3 1022 10.0K
rum.normn Linux 2.4.0-t 15  1 28  3

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Ricky Beam

On 23 Oct 2000, Patrick J. LoPresti wrote:
>Once the name resolution times out, you might expect things to become
>unstuck.  But they don't.

Negative.  Things have been queued.  The deadlock will only go away if the
very next message processed is the named local message.  And then it would
have to process a few more local messages so it wouldn't stall again so soon.

>> Per chance are you running the name service caching daemon (nscd)?
>
>No.

Please do.  That will reduce the ammount of traffic to the name server.

>> I'd also guess you aren't disabling fsync() for your sysylog files
>> (it's part of the syslog.conf format) -- this is a conciderable
>> drain on syslogd.
>
>I see no documentation for such an option in the syslog.conf man page.
>This is with the current Red Hat 6.2 syslogd (package
>sysklogd-1.3.31-17).

It's in the syslogd and syslog.conf man page (sysklogd-1.3.31-16):
(syslog.conf)
   Regular File
   Typically  messages are logged to real files. The file has
   to be specified with full pathname, beginning with a slash
   ``/''.

   You  may  prefix  each  entry with the minus ``-'' sign to
   omit syncing the file after every logging. Note  that  you
   might  lose information if the system crashes right behind
   a write attempt. Nevertheless this  might  give  you  back
   some  performance, especially if you run programs that use
   logging in a very verbose manner.

--Ricky

PS: as a side note, you can/will lose information even if sync is enabled.
(fsync() will not flush metadata so the file is truncated on restart.)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: The zen of kernel virtual addresses

2000-10-23 Thread Jeremy Fitzhardinge

On Sat, Oct 21, 2000 at 01:37:26PM -0600, Jonathan Corbet wrote:
> physical address
>   An address as known by the low-level hardware.  In the modern
>   world, these can be 64-bit quantities, even on 32-bit systems.
>   These are the addresses used by /dev/mem - which appears to work
>   only for low memory.

A phyical address is the address the CPU uses to talk to memory.  It is not
necessarily the same kind of address a device uses to see memory: they use
bus addresses.  The simple case is bus address == physical address, but 
there are many variations.  Systems with an IOMMU (or equiv) present devices
with a completely different view of system memory.

J

 PGP signature

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 14:34, Linus Torvalds wrote:
> On Mon, 23 Oct 2000, Alexander Viro wrote:
> > On Mon, 23 Oct 2000, Linus Torvalds wrote:
> > > 
> > > Nope, that just makes the race window smaller. We should check for i_size
> > > after we've gotten the page table lock and just before actually entering
> > > the page into the page tables. Otherwise we'll still race on SMP (a _very_
> > > hard window to get into, admittedly).
> > 
> > Umm... I would probably remove Uptodate upon truncate() and check _that_
> > in the place you've mentioned.
> 
> Works for me..
> 
> > > ClearPageDirty(page);
> >   ClearPageUptodate(page);
> > 
> > How about that?
> 
> Makes sense. 

With ClearPageDirty() kernel locked up (but no watchdog, so probably
some livelock) during bootup after fsck /. Unfortunately, 'PC' column
shows complete garbage - init,kswapd,kupdate and rcS in 'R', with
'PC'=current for rcS, 0xc1459f28 for init, 0xc146ffa8 for kswapd and
0xc1469fc8 for kupdate (needless to say that my kernel does not have
20MB... whee what's with PC? it now prints esp instead of eip?). Should 
I try ClearPageUptodate() instead?

Petr Vandrovec
[EMAIL PROTECTED]

P.S.: I have to go home. Continuing after 10 hours.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Patrick J. LoPresti

Ricky Beam <[EMAIL PROTECTED]> writes:

> Personally, I'd look closely at your setup to determine exactly why
> this has become a problem.  named is being blocked on writing to
> /dev/log.  This should only happen if there is sufficient _local_
> syslog traffic to fill the buffer or syslogd has too much remote
> traffic to ever read from /dev/log.

There is a lot of local traffic as well, yes.

Lots of local traffic means named eventually finds itself waiting in
line to log.  Lots of remote traffic means syslogd is trying to talk
to named a lot (to do reverse lookups).  named waiting in line +
syslogd trying to talk to named == deadlock; this is not too hard to
see.

Once the name resolution times out, you might expect things to become
unstuck.  But they don't.

Perhaps syslogd is not giving higher priority to local messages; if it
did, maybe it could recover from the deadlock.  But this would not be
a reliable solution; the only reliable solution is for syslogd to be
independent of any processes which need to talk to it.

> Per chance are you running the name service caching daemon (nscd)?

No.

> I'd also guess you aren't disabling fsync() for your sysylog files
> (it's part of the syslog.conf format) -- this is a conciderable
> drain on syslogd.

I see no documentation for such an option in the syslog.conf man page.
This is with the current Red Hat 6.2 syslogd (package
sysklogd-1.3.31-17).

 - Pat
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Ricky Beam

On 23 Oct 2000, Patrick J. LoPresti wrote:
>So I have the glibc maintainer (and others) saying that syslog
>messages should never be dropped, and you saying that named should be
>dropping its syslog messages.

No, I didn't say they "should" be dropped but merely that dropping them
would fix your problem.  Personally, I'd look closely at your setup to
determine exactly why this has become a problem.  named is being blocked
on writing to /dev/log.  This should only happen if there is sufficient
_local_ syslog traffic to fill the buffer or syslogd has too much remote
traffic to ever read from /dev/log.

Per chance are you running the name service caching daemon (nscd)?  I'd
also guess you aren't disabling fsync() for your sysylog files (it's part
of the syslog.conf format) -- this is a conciderable drain on syslogd.

--Ricky

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> On Mon, 23 Oct 2000, Linus Torvalds wrote:
> > 
> > Nope, that just makes the race window smaller. We should check for i_size
> > after we've gotten the page table lock and just before actually entering
> > the page into the page tables. Otherwise we'll still race on SMP (a _very_
> > hard window to get into, admittedly).
> 
> Umm... I would probably remove Uptodate upon truncate() and check _that_
> in the place you've mentioned.

Works for me..

> > ClearPageDirty(page);
>   ClearPageUptodate(page);
> 
> How about that?

Makes sense. 

Note that I'd actually like to hear from Petr first _without_ any added
code in the nopage() handler - the issue of having a page mapped after
nopage() that we shouldn't have mapped is a separate one, and I'd first
like to hear if the problem really goes away even if the mapping bug is
still there..

And _then_ we fix the fact that we should not allow anybody to have a page
mapped past i_size.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: VMWare and kswapd

2000-10-23 Thread Shane Shrybman


On Mon, 23 Oct 2000, Michael Rothwell wrote:

> I'm trying out the new VMWare for Linux, and noticed that if I let it
> run for a couple of days, I get this:
> 
> Oct 16 15:07:49 cartman kernel: VM: do_try_to_free_pages failed for
> kswapd...
[...snip...]

I think this a known issue with the 2.2 kernels. Take a look in the
mailing list archives for a thread with subject line:

VM: do_try_to_free_memory failed for , 2.2.17, 2.2.18pre3

starting around Oct 11 or 12.


shane

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). 
> This signals to me: should not vmtruncate_list acquire mm->mmap_sem, 
> if it modifies page tables?

No. 

It should get the page_table lock, but that is sufficient for anybody who
_clears_ page tables (and is pretty much the same case as paging something
out of somebody elses page tables).

>I cannot find anything what prevents doing 
> vmtruncate in one task and filemap_sync in another - neither
> page_table_lock spinlock, nor mmap_sem semaphore...

filemap_sync() does hold the page table lock. 

(Which is certainly not to say that it's necessarily bug-free, but I don't
see any obvious problems off-hand).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro




On Mon, 23 Oct 2000, Linus Torvalds wrote:

> On Mon, 23 Oct 2000, Alexander Viro wrote:
> > 
> > That's fine, but I'm afraid that we'll need a bit more than that. A couple of
> > obvious ones:
> > * filemap_nopage() needs the second check for ->i_size. Upon exit.
> 
> Nope, that just makes the race window smaller. We should check for i_size
> after we've gotten the page table lock and just before actually entering
> the page into the page tables. Otherwise we'll still race on SMP (a _very_
> hard window to get into, admittedly).

Umm... I would probably remove Uptodate upon truncate() and check _that_
in the place you've mentioned.

> So how about this truncate_complete_page() implementation:
> 
>   /*
>* Try to get rid of a page.. Clear it if it fails
>* for some reason. The page must be locked upon calling
>* this function.
>*
>* We remove the page from the page cache _after_ we have
>* destroyed all buffer-cache references to it. Otherwise some
>* other process might think this inode page is not in the
>* page cache and creates a buffer-cache alias to it causing
>* all sorts of fun problems ...
>*/
>   static inline void truncate_complete_page(struct page *page)
>   {
>   /* Try to get rid of buffers */
>   if (page->buffers)
>   block_flushpage(page, 0);
>   
>   spin_lock(&pagecache_lock);
>   spin_lock(&pagemap_lru_lock);
>   
>   if (page_count(page) != 1) {
>   memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
>   } else {
>   ClearPageDirty(page);
ClearPageUptodate(page);

How about that?

>   __lru_cache_del(page);
>   __remove_inode_page(page);
>   page_cache_release(page);
>   }
>   spin_unlock(&pagemap_lru_lock);
>   spin_unlock(&pagecache_lock);
>   }
> 
> we should probably special-case the "block_flushpage()" failed case, but
> the above should do reasonable things with it (because page_count() will
> be > 1 due to buffers).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Tobias Ringstrom wrote:

> On 23 Oct 2000, Linus Torvalds wrote:
> 
> > Either 
> > 
> >  (a) Solaris has solved the faster-than-light problem, and Sun engineers
> >  should get a Nobel price in physics or something.
> > 
> >  (b) Solaris "scales" by being optimized for 1 entries, and not
> >  speeding up sufficiently for a small number of entries.
> > 
> > You make the call. 
> 
> You will probably get the 6.5 factor because you have some (big) contant
> setup time. For example
> 
>   t = 1700 + n
> 
> This gives you an increase of 6.5 from 100 to 1, although it is of
> course is O(n). No magic there... :-)

Indeed.

NOTE! I'm not saying that this is necessarily bad. It may well be that the
setup time means that the Solaris code actually _does_ perform really well
for the 1 entry case.

I hope nobody took my rant against "scalability" to mean that I consider
the Solaris case to necessarily be bad engineering. My rant was more about
people often thinking that "scalability == performance", which it isn't.
It may be that Solaris simply wants to do 1 entries really fast and
that they actually do really well, but it is clear that if so they have a
huge performance hit for the small cases, and people should realize that.

It's basically a matter of making the proper trade-offs. I think that the
"few file descriptors" case is actually arguably a very important one. Sun
(who has long since given up on the desktop) probably have a different
tradeoff, and maybe their big constant setup-time is worth it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> That's fine, but I'm afraid that we'll need a bit more than that. A couple of
> obvious ones:
>   * filemap_nopage() needs the second check for ->i_size. Upon exit.

Nope, that just makes the race window smaller. We should check for i_size
after we've gotten the page table lock and just before actually entering
the page into the page tables. Otherwise we'll still race on SMP (a _very_
hard window to get into, admittedly).

> Moreover, what is (area->vm_mm == current->mm) doing in the existing check?

It's for ptrace. You can do ugly things with ptrace that aren't possible
for the process itself.

>   * truncate() should zero the page out if it doesn't remove it from
> cache.

So how about this truncate_complete_page() implementation:

/*
 * Try to get rid of a page.. Clear it if it fails
 * for some reason. The page must be locked upon calling
 * this function.
 *
 * We remove the page from the page cache _after_ we have
 * destroyed all buffer-cache references to it. Otherwise some
 * other process might think this inode page is not in the
 * page cache and creates a buffer-cache alias to it causing
 * all sorts of fun problems ...
 */
static inline void truncate_complete_page(struct page *page)
{
/* Try to get rid of buffers */
if (page->buffers)
block_flushpage(page, 0);

spin_lock(&pagecache_lock);
spin_lock(&pagemap_lru_lock);

if (page_count(page) != 1) {
memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE);
} else {
ClearPageDirty(page);
__lru_cache_del(page);
__remove_inode_page(page);
page_cache_release(page);
}
spin_unlock(&pagemap_lru_lock);
spin_unlock(&pagecache_lock);
}

we should probably special-case the "block_flushpage()" failed case, but
the above should do reasonable things with it (because page_count() will
be > 1 due to buffers).

The above is obviously completely and utterly untested. Petr? Willing to
give this a go?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

linux 2.2.18-pre17: "Kernel panic: LRU list corrupted"

2000-10-23 Thread H. Peter Anvin


Hi there,

I wanted to let you know that I was trying 2.2.18-pre17 on
hera.kernel.org, a uniprocessor with an SMP motherboard.  After about six
hours, it went catatonic, responding to pings and TCP SYNs but not doing
anything that required user space.

On the console, it had multiple copies of the message:

"Kernel panic: LRU list corrupted"  [fs/buffer.c:438]

... but no register dump.

I have fallen back to 2.2.17 and it has run stably for a few days now.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec

On 23 Oct 00 at 13:57, Linus Torvalds wrote:
> On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> > First page->mapping == NULL entry in syslog is dated 22:23:58, but
> > couple of entries was lost before (probably I should print only '.' for
> > each such page; this run there was more than 100 such pages)
> > Another question is why SIGCHLD was delivered to parent AFTER ftruncate,
> > but exit was invoked couple of seconds before - maybe it syncs
> > child address space to disk?
> 
> exit() basically does do a msync(MSASYNC), so that could be it. 

Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). 
This signals to me: should not vmtruncate_list acquire mm->mmap_sem, 
if it modifies page tables? I cannot find anything what prevents doing 
vmtruncate in one task and filemap_sync in another - neither
page_table_lock spinlock, nor mmap_sem semaphore...
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Tobias Ringstrom

On 23 Oct 2000, Linus Torvalds wrote:

> Either 
> 
>  (a) Solaris has solved the faster-than-light problem, and Sun engineers
>  should get a Nobel price in physics or something.
> 
>  (b) Solaris "scales" by being optimized for 1 entries, and not
>  speeding up sufficiently for a small number of entries.
> 
> You make the call. 

You will probably get the 6.5 factor because you have some (big) contant
setup time. For example

t = 1700 + n

This gives you an increase of 6.5 from 100 to 1, although it is of
course is O(n). No magic there... :-)

/Tobias

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: make -j 2 broken?

2000-10-23 Thread Wakko Warner


> Trying to compile the current kernel (test10-pre4) with:
> 
> > make clean
> > make -j 2 bzImages modules modules_install 
> 
> will try to install the modules before they are built...
> This has previously been working (at least in early testX kernels).

I've done it before when I wasn't thinking, since it parallels those,
modules_install just calls modules_install.  AFAIK, it doesn't depend on
modules actually being compiled.

Typically, I do:
make -j 20 dep;make -j 20 bzImage modules;make modules_install

Yes, I've heard about the paralleled dep problem (there was one, right?),
but it hasn't effected me.

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro


On Mon, 23 Oct 2000, Linus Torvalds wrote:

> On Mon, 23 Oct 2000, Alexander Viro wrote:
> 
> > On Mon, 23 Oct 2000, Linus Torvalds wrote:
> > 
> > > Al, any ideas? I have this feeling that the simplest fix is just to leave
> > > the race open, and make truncate_complete_page() just leave such a "racy"
> > > page in the page cache. It will still race, and the invalid page will
> > > still exist, but the end result should be harmless.
> > 
> > Provided that we clean it - why the hell do we want to take it out of
> > the pagecache? I don't see any fundamental reasons to prohibit pages
> > past the ->i_size being hashed.
> 
> In fact, we used to allow them.
> 
> We do want to remove it from the page cache under normal circumstances: it
> makes for much better MM behaviour (ie free pages that are truly useless).
> But I suspect that just adding a test for "page_count(page) == 1" (ie same
> as for the page cache invalidation) before freeing it should give that
> advantage, along with avoiding the problematic unlikely case...

That's fine, but I'm afraid that we'll need a bit more than that. A couple of
obvious ones:
* filemap_nopage() needs the second check for ->i_size. Upon exit.
Moreover, what is (area->vm_mm == current->mm) doing in the existing check?
* truncate() should zero the page out if it doesn't remove it from
cache.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Patrick J. LoPresti

Ricky Beam <[EMAIL PROTECTED]> writes:

> syslogd isn't the blocker.  The syslog functions in glibc being
> called by named are the problem.  Stop named from blocking on syslog
> writes and the world will be happy again.

So I have the glibc maintainer (and others) saying that syslog
messages should never be dropped, and you saying that named should be
dropping its syslog messages.

One more time, from the top:

named is calling syslog(), a glibc function.  This function *blocks*
waiting for delivery to the local syslogd, even though it is using
SOCK_DGRAM sockets.  There is no option to openlog() or syslog() to
get non-blocking behavior (the LOG_NDELAY option means something else
entirely).

You are effectively suggesting that named should be rewritten not to
use the glibc syslog functions at all.  That strikes me as the worst
suggestion so far; it would be far better for syslogd not to do name
lookups.  But my syslogd has no option to avoid name lookups; I will
submit a request to add one.

> I've gotta ask what kind of "load" can cause this to happen.

> And for the record, syslogd shouldn't be doing DNS lookups for
> things arriving via /dev/log -- that's always the local machine.

This particular syslogd also accepts messages from remote hosts.  So
when there is a lot of syslog traffic, this syslogd talks to named a
lot.  named occasionally sends messages to syslog.  Since syslog
pauses waiting for named to respond to name queries, and named blocks
waiting for syslog to consume the message, a deadlock is triggered.
True, it is not a full deadlock, because the name query times out
eventually.  But it is bad enough that the system becomes largely
non-responsive.

 - Pat
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> 
> Yes. Bad news. No problem was catched in filemap_nopage, but one
> (of 57000) pages was dirty and had page->mapping == NULL... (maybe
> only one was caused that this was just after bootup, with plenty of memory)
> Maybe I should look at readahead code?

Yes, you're right, read-ahead is in fact even more likely to catch the
race.

> In case of truncate it is going irrevocable away. Accesses after truncate
> should (and sometime give you) SIGBUS...

Yes. But I'd rather have a small race where we under certain strange
circumstances forget to raise a SIGBUS than have a kernel bug that causes
oops and possible filesystem corruption. 

So I'm not just looking for a fix, I'm also looking for a SIMPLE fix, with
perhaps some major surgery later (things like making the inode semaphore a
read-write semaphore and just always synchronizing 100% on i_size - which
would fix the problem, but is not a 2.4.x thing, no way Jose).

> First page->mapping == NULL entry in syslog is dated 22:23:58, but
> couple of entries was lost before (probably I should print only '.' for
> each such page; this run there was more than 100 such pages)
> Another question is why SIGCHLD was delivered to parent AFTER ftruncate,
> but exit was invoked couple of seconds before - maybe it syncs
> child address space to disk?

exit() basically does do a msync(MSASYNC), so that could be it. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Alexander Viro wrote:

> On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > Al, any ideas? I have this feeling that the simplest fix is just to leave
> > the race open, and make truncate_complete_page() just leave such a "racy"
> > page in the page cache. It will still race, and the invalid page will
> > still exist, but the end result should be harmless.
> 
> Provided that we clean it - why the hell do we want to take it out of
> the pagecache? I don't see any fundamental reasons to prohibit pages
> past the ->i_size being hashed.

In fact, we used to allow them.

We do want to remove it from the page cache under normal circumstances: it
makes for much better MM behaviour (ie free pages that are truly useless).
But I suspect that just adding a test for "page_count(page) == 1" (ie same
as for the page cache invalidation) before freeing it should give that
advantage, along with avoiding the problematic unlikely case...

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: VMWare and kswapd

2000-10-23 Thread Petr Vandrovec


On 23 Oct 00 at 16:07, Michael Rothwell wrote:
> I'm trying out the new VMWare for Linux, and noticed that if I let it
> run for a couple of days, I get this:
> 
> Oct 16 15:07:49 cartman kernel: VM: do_try_to_free_pages failed for
> kswapd...
> 
> ... after which things go to hell pretty fast. The previous VMWare would
> oops the kernel so badly that I would go from working away to seeign the
> BIOS screen suddenly. Unfortunately, the new problem doesn't produce any
> other data, even and oops. If I produce memory contention (running
> Netscape at the same time as VMWare, for instance), it happens sooner.
> 
> This is on a 2.2.16+USB kernel.

You may want to upgrade to 2.2.17, or newer...
 
> I know the LKML isn't VMWare tech support. I'm just wondering if anyone
> else has been getting similar results but better data so that a bug
> report can be sent to VMWare.

It is better to ask in news://news.vmware.com, in 
vmware.for-linux.{experimental,misc}. But I recommend you to buy more
memory, or downgrade window manager. If you have only 128MB of RAM, and
you are running Gnome with XF4.0, and VMware with 80MB virtual NT,
you can stress system beyond limits...
   Petr Vandrovec
   [EMAIL PROTECTED]
   
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Topic for discussion: OS Design

2000-10-23 Thread Richard B. Johnson

On Mon, 23 Oct 2000, Dennis wrote:

> This is typical of the "linux mentality". Why do other OSs have solutions 
> that work, yet linux's method requires special coding? If it "has to be 
> done that way", why do other OS's have solutions that dont do it that way? 
> the size of the buffer is an annoyance but not a serious problem however.
> 

I'm not sure that Linux requires any special coding.

> printing directly to the console (as BSD does) is useful when debugging a 
> panic, as you can trace right to the panic point. Also certain levels of 

BSD does not write directly to the console. Its console is a direct
clone of Linux. I'm not sure which came first, but when you have a single
screen-card there are not too many ways to get the character and attribute
into screen memory. Linux allows the console to be redirected to a serial
port. BSD does not last time I checked.

> >
> >Bugs that were found when changing the design of various kernel
> >procedures, have been back-ported to the stable kernel series.
> 
> I never use development kernels, what Im talking about is each major 
> release is like starting from version 1.0. By the time it stabilizes, the 
> next major puts it back to square one.

What? The API has remained consistent since 0.99. It's only internal
kernel stuff that has changed. If you wrote code that worked on 0.99,
it will still compile and work on 2.2.17.

You could not have written code for 0.99 that used mmap() and some
other stuff because it had not been developed yet. However, all the
"Unix stuff" like read/write/open/close/ioctl/lseek, etc., and their
buffered versions like fopen() from the 'C' runtime library, have had
the same API since day one. Linux was developed, from the start, to
have a POSIX compatible API. Most of that API comes from the 'C'
runtime library, the exact same API used by BSD and all the other
OS's to which the GNU library has been ported. 

The only reason to get the 'latest' version is to take advantage
of increased functionality. This, by definition, means that something
has changed. That's what you upgrade for. The word "unstable" is a
misnomer.

> My point is that there is no "stable kernel series". 2.2.0 wasnt stable, 
> and neither was 2.2.3. Virtually all of the ethernet drivers still lock up 
> under heavy load in 2.2.17...and now with 2.4 there are more countless 
> adventures ahead

Which Ethernet drivers are you having trouble with? The ones that had
lockup problems (incidentally hardware related), now have reset code
that runs off a watch-dog.

> an example of "poor planning" is that skput and skpush will panic the 
> kernel if there is no room (this can happen with multiple encapsulations) 
> The proper behaviour would be to return a NULL pointer indicating failure, 
> and then to drop the frame and issue a warning.

The proper response to any resource not being available (in networking)
is to drop the packet on the floor, smash it into little pieces, and
don't tell anybody about it.

The packet will be sent again. But, if you can't transmit a packet,
therefore freeing a buffer, what do you do?

What you do is realilize that the failure to transmit was likely
caused by a disconnected wire. In the drivers I use, I simply pretend
that every packet got transmitted okay. This usually involves a
one or two-line modification to the driver.

This has nothing to do with poor planning. It just has to do with
a design decision that I didn't agree with. Somebody decided that
network data was precious and therefore the machine should kill itself
if necessary to get the data through. 

I didn't agree with this so I changed a few lines of code. You can't
kill any of my machines by flooding them and they never lock up.
Further, they run at 85 to 90 percent of the network physical layer
bandwidth. My main machine is our domain name-server, it gets between
2000 and 5000 hits per second. If it crashed, our whole LAN goes
down. It doesn't. It runs Linux-2.2.17.

Script started on Mon Oct 23 16:12:02 2000
# rlogin boneserver
Password: 
Last login: Mon Oct 23 11:28:29 from chaos.analogic.com
Linux 2.2.17.

[3g[25;9HH[25;17HH[25;25HH[25;33HH[25;41HH[25;49HH[25;57HH[25;65HH[25;73HH

<)0
# uptime
  4:12pm  up 24 days, 22:21, 11 users,  load average: 0.81, 0.62, 0.00
# exit
logout
rlogin: connection closed.
# exit
exit

Script done on Mon Oct 23 16:12:27 2000

Those 11 users are all network servers including samba for M$
connectivity.

One of the major advantages of Linux is that if you don't like
a design decision that was made, you are free to do it over the
way you think is right.

Sometimes you can convince others that your way is better. Sometimes
not. If so, your patch makes it into the main-line kernel, if not,
you patch your own future kernels so you get to retain your
improvements. FYI, if AC did not exist, another would appear
to fill the vacuum. Don't bitch. Make some improvements and send
patches.

Cheers,
Dick Johnson

Penguin :

Re: mount: Unable to handle kernel paging request at virtual address

2000-10-23 Thread Brian Gerst


David Dyck wrote:
> 
> I am getting a repeatable oops during the boot up phase,
> with linux 2.4.0  test10-pre4
> 
> Even a simple "mount /proc" command yields an oops.
> I believe I have the latest mount program.
> 
> Unable to handle kernel paging request at virtual address 08067000
> c01f90d0
> *pde = 07f42067
> Oops: 
> CPU:0
> EIP:0010:[]
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010206
> eax:    ebx:    ecx: 00a0   edx: 08067280
> esi: 08067000   edi: c7ec3d80   ebp: c7f3ffbc   esp: c7f3ff64
> ds: 0018   es: 0018   ss: 0018
> Process mount (pid: 18, stackpage=c7f3f000)
> Stack: c7f3e000 08066280 1000 c0134610 c7ec3000 08066280 1000 c7f3e000
>08066270 08066260 080662b0 c7ec3000 0009 c01349b2 08066280 c7f3ffbc
>c7f3e000 c0ed 080662b0 bb84 c7f3e000   c010906b
> Call Trace: [] [] []
> Code: f3 a5 89 c1 f3 a4 89 c8 5b 5e 5f c3 8d 74 26 00 57 56 8b 7c
> 
> >>EIP; c01f90d0 <__generic_copy_from_user+30/40>   <=
> Trace; c0134610 
> Trace; c01349b2 
> Trace; c010906b 
> Code;  c01f90d0 <__generic_copy_from_user+30/40>  <_EIP>:
> Code;  c01f90d0 <__generic_copy_from_user+30/40>0:   f3 a5 
>repz movsl %ds:(%esi),%es:(%edi)   <=
> Code;  c01f90d2 <__generic_copy_from_user+32/40>2:   89 c1 
>mov%eax,%ecx
> Code;  c01f90d4 <__generic_copy_from_user+34/40>4:   f3 a4 
>repz movsb %ds:(%esi),%es:(%edi)
> Code;  c01f90d6 <__generic_copy_from_user+36/40>6:   89 c8 
>mov%ecx,%eax
> Code;  c01f90d8 <__generic_copy_from_user+38/40>8:   5b
>pop%ebx
> Code;  c01f90d9 <__generic_copy_from_user+39/40>9:   5e
>pop%esi
> Code;  c01f90da <__generic_copy_from_user+3a/40>a:   5f
>pop%edi
> Code;  c01f90db <__generic_copy_from_user+3b/40>b:   c3
>ret
> Code;  c01f90dc <__generic_copy_from_user+3c/40>c:   8d 74 26 00   
>lea0x0(%esi,1),%esi
> Code;  c01f90e0 <__strncpy_from_user+0/30>   10:   57
>push   %edi
> Code;  c01f90e1 <__strncpy_from_user+1/30>   11:   56
>push   %esi
> Code;  c01f90e2 <__strncpy_from_user+2/30>   12:   8b 7c 00 00   mov 
>   0x0(%eax,%eax,1),%edi

This should have been trapped by the exception handling routines.  One
possible explanation is that the exception table is not sorted correctly
by the linker.  This can happen if an exception entry is made for an
address that is in another section than .text.  The exception handler
does a binary search which can be tripped up by an out of sequence
entry.

Hmm, I wonder if GCC inlined do_test_wp_bit().  That would put an
exception in the .text.init section.  Could you check to see if this
happened?

--

Brian Gerst
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

make -j 2 broken?

2000-10-23 Thread Roger Larsson


Hi,

Trying to compile the current kernel (test10-pre4) with:

> make clean
> make -j 2 bzImages modules modules_install 

will try to install the modules before they are built...
This has previously been working (at least in early testX kernels).

make --version
  GNU Make version 3.79.1, by Richard Stallman and Roland McGrath. 

/RogerL

--
Home page:
  http://www.norran.net/nra02596/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Petr Vandrovec


On 23 Oct 00 at 16:13, Alexander Viro wrote:
> On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > Al, any ideas? I have this feeling that the simplest fix is just to leave
> > the race open, and make truncate_complete_page() just leave such a "racy"
> > page in the page cache. It will still race, and the invalid page will
> > still exist, but the end result should be harmless.
> 
> Provided that we clean it - why the hell do we want to take it out of
> the pagecache? I don't see any fundamental reasons to prohibit pages
> past the ->i_size being hashed. Methods _must_ check for ->i_size,
> but they do it anyway. All race-prevention is based on page locks and
> ->i_sem.
> 
> Yes, filemap_nopage() should check for i_size at the very end and fail if
> the page became off-limits. But that's completely unrelated issue - it's
> mmap semantics, not pagecache one.

Yes. Bad news. No problem was catched in filemap_nopage, but one
(of 57000) pages was dirty and had page->mapping == NULL... (maybe
only one was caused that this was just after bootup, with plenty of memory)
Maybe I should look at readahead code? Although to be clear I do not 
know why. Unless there is bug in logic in test program, it should 
first dirty pages, and AFTER that it should truncate - and unmap and 
exit, without ever touching pages of mapping... My first testcases were 
with this race (and with raw devices), but then I found (by removing 
more and more code) that no race (and no raw devices) are required...
 
> The point being: we should _never_ drop ->mapping unless the page is
> irrevocably going away. We can (and probably should) drop the off-limits
> page as soon as ->count hits zero, but we should not do it before that.

In case of truncate it is going irrevocable away. Accesses after truncate
should (and sometime give you) SIGBUS...

 total   used   free sharedbuffers cached
Mem:255768  42208 213560  0496  18420
-/+ buffers/cache:  23292 232476
Swap:   530136  13200 516936

Strace of another run:

1688  22:23:41.748438 execve("./oopsdemo", ["./oopsdemo"], [/* 18 vars */]) = 0
1688  22:23:41.749058 brk(0)= 0x8049ae8
1688  22:23:41.749399 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40016000
1688  22:23:41.749641 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file 
or directory)
1688  22:23:41.749861 open("/etc/ld.so.cache", O_RDONLY) = 4
1688  22:23:41.750011 fstat(4, {st_mode=S_IFREG|0644, st_size=43818, ...}) = 0
1688  22:23:41.750253 old_mmap(NULL, 43818, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40017000
1688  22:23:41.750408 close(4)  = 0
1688  22:23:41.750538 open("/lib/libc.so.6", O_RDONLY) = 4
1688  22:23:41.750676 fstat(4, {st_mode=S_IFREG|0755, st_size=1057576, ...}) = 0
1688  22:23:41.750878 read(4, 
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\224\314"..., 4096) = 4096
1688  22:23:41.751173 old_mmap(NULL, 1072484, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) 
= 0x40022000
1688  22:23:41.751327 mprotect(0x4011e000, 40292, PROT_NONE) = 0
1688  22:23:41.751441 old_mmap(0x4011e000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED, 4, 0xfb000) = 0x4011e000
1688  22:23:41.751633 old_mmap(0x40124000, 15716, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40124000
1688  22:23:41.751809 close(4)  = 0
1688  22:23:41.753291 munmap(0x40017000, 43818) = 0
1688  22:23:41.753554 getpid()  = 1688
1688  22:23:41.753785 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(4, 1), ...}) = 0
1688  22:23:41.753993 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000
1688  22:23:41.754162 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
1688  22:23:41.754430 write(1, "Go\n", 3) = 3
1688  22:23:41.754727 open("ram0", O_RDWR|O_CREAT, 0600) = 4
1688  22:23:41.754949 unlink("ram0")= 0
1688  22:23:41.755103 ftruncate(4, 234881024) = 0
1688  22:23:41.755228 old_mmap(NULL, 234881024, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 
0) = 0x40128000
1688  22:23:41.755700 pipe([5, 6])  = 0
1688  22:23:41.755847 pipe([7, 8])  = 0
1688  22:23:41.756024 fork()= 1689
1689  22:23:41.756735 close(6 
1688  22:23:41.756798 close(5 
1689  22:23:41.756844 <... close resumed> ) = 0
1688  22:23:41.756894 <... close resumed> ) = 0
1689  22:23:41.756949 close(7 
1688  22:23:41.756997 close(8 
1689  22:23:41.757041 <... close resumed> ) = 0
1688  22:23:41.757089 <... close resumed> ) = 0
1689  22:23:41.757139 close(4 
1688  22:23:41.757188 write(6, "\0", 1 
1689  22:23:41.757250 <... close resumed> ) = 0
1688  22:23:41.757301 <... write resumed> ) = 1
1689  22:23:41.757355 read(5,  
1688  22:23:41.757405 read(7,  
1689  22:23:41.757450 <... read resumed> "\0", 1) = 1
1689  22:23:49.260756 write(8, "\0", 1) = 1
1689  22:23:49.436204 read(5,  
1688  22:23:49.442969 <... read resumed> "\

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Lyle Coder

Hi,
If you have a similar machine (in terms machine configuration) for both your
solaris and linux machines... could you tell us what the difference in total
time for 100 and 1 was?  i.e... dont compare solaris with 100
descripters vs solaris with 1 descriptors, but rather
Linux 100 descripters Vs. Solaris 100 descriptors  AND
Linux 1 descriptors Vs. Solaris 1 descriptors.

That would be useful informatio... I think.

Thanks
Lyle

Re: Linux's implementation of poll() not scalable?

[ Small treatize on "scalability" included. People obviously do not
  understand what "scalability" really means. ]

In article <[EMAIL PROTECTED]>,
Dan Kegel  <[EMAIL PROTECTED]> wrote:
>I ran a benchmark to see how long a call to poll() takes
>as you increase the number of idle fd's it has to wade through.
>I used socketpair() to generate the fd's.
>
>Under Solaris 7, when the number of idle sockets was increased from 100 to 
>1, the time to check for active sockets with poll() increased by a 
>factor of only 6.5.  That's a sublinear increase in time, pretty spiffy.

Yeah. It's pretty spiffy.

Basically, poll() is _fundamentally_ a O(n) interface. There is no way
to avoid it - you have an array, and there simply is _no_ known
algorithm to scan an array in faster than O(n) time. Sorry.

(Yeah, you could parallellize it.  I know, I know.  Put one CPU on each
entry, and you can get it down to O(1).  Somehow I doubt Solaris does
that.  In fact, I'll bet you a dollar that it doesn't).

So what does this mean?

Either

(a) Solaris has solved the faster-than-light problem, and Sun engineers
 should get a Nobel price in physics or something.

(b) Solaris "scales" by being optimized for 1 entries, and not
 speeding up sufficiently for a small number of entries.

You make the call.

Basically, for poll(), perfect scalability is that poll() scales by a
factor of 100 when you go from 100 to 1 entries. Anybody who does
NOT scale by a factor of 100 is not scaling right - and claiming that
6.5 is a "good" scale factor only shows that you've bought into
marketing hype.

In short, a 6.5 scale factor STINKS. The only thing it means is that
Solaris is slow as hell on the 100 descriptor case.

>Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets 
>was increased from  100 to 1, the time to check for active sockets with 
>poll() increased by a factor of 493 [or 300, respectively].

So, what you're showing is that Linux actually is _closer_ to the
perfect scaling (Linux is off by a factor of 5, while Solaris is off by
a factor of 15 from the perfect scaling line, and scales down really
badly).

Now, that factor of 5 (or 3, for 2.4.0) is still bad.  I'd love to see
Linux scale perfectly (which in this case means that 1 fd's should
take exactly 100 times as long to poll() as 100 entries take).  But I
suspect that there are a few things going on, one of the main ones
probably being that the kernel data working set for 100 entries fits in
the cache or something like that.

>Please, somebody point out my mistake.  Linux can't be this bad!

I suspect we could improve Linux in this area, but I hope that I pointed
out the most fundamental mistake you did, which was thinking that
"scalability" equals "speed".  It doesn't.

Scalability really means that the effort to handle a problem grows
reasonably with the hardness of the problem. And _deviations_ from that
are indications of something being wrong.

Some people think that super-linear improvements in scalability are
signs of "goodness".  They aren't.  For example, the classical reason
for super-linear SMP improvement (with number of CPU's) that people get
so excited about really means that something is wrong on the low end.
Often the "wrongness" is lack of cache - some problems will scale better
than perfectly simply because with multiple CPU's you have more cache.

The "wrongess" is often also selecting the wrong algorithm: something
that "scales well" by just being horribly slow for the small case, and
being "less bad" for the big cases.

In the end, the notion of "scalability" is meaningless. The only
meaningful thing is how quickly something happens for the load you have.
That's something called "performance", and unlike "scalability", it
actually has real-life meaning.

Under Linux, I'm personally more worried about the performance of X etc,
and small poll()'s are actually common. So I would argue that the
Solaris scalability is going the wrong way. But as performance really
depends on the load, and maybe that 1 entry load is what you
consider "real life", you are of course free to disagree (and you'd be
equally right ;)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/
_
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotma

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Ricky Beam

On 23 Oct 2000, Patrick J. LoPresti wrote:
>Turning down the DNS timeout would affect *all* name resolution on the
>system, right?  That is not acceptable.

You should be able to set it on a per-process basis (via an ENV var.)

>As I said, I already have a workaround, which is to have named log to
>a flat file.  I agree that this is a poor workaround, and the "right
>fix" is to modify syslogd not to perform blocking operations.  My only
>quibble is that SOCK_DGRAM is an odd transport to use here, even over
>AF_UNIX.

syslogd isn't the blocker.  The syslog functions in glibc being called by
named are the problem.  Stop named from blocking on syslog writes and the
world will be happy again.  I've gotta ask what kind of "load" can cause
this to happen.

And for the record, syslogd shouldn't be doing DNS lookups for things
arriving via /dev/log -- that's always the local machine.

--Ricky

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x

2000-10-23 Thread Patrick J. LoPresti

Ricky Beam <[EMAIL PROTECTED]> writes:

> I would suggest disabling name resolution for syslog, but that's an
> ugly option.  There's no way to stop a glibc system from doing a DNS
> query for a reverse lookup.  HOWEVER, you can set the DNS timeout to
> 1 second and set the resolver options to prevent recursion (answer
> from cache only.)

Recursion has nothing to do with it; as I said, the named on this
system is itself authoritative for all of the reverse lookups.

Turning down the DNS timeout would affect *all* name resolution on the
system, right?  That is not acceptable.

As I said, I already have a workaround, which is to have named log to
a flat file.  I agree that this is a poor workaround, and the "right
fix" is to modify syslogd not to perform blocking operations.  My only
quibble is that SOCK_DGRAM is an odd transport to use here, even over
AF_UNIX.

> PS: Technically, this is not a lockup.  syslogd should eventually
> timeout waiting for the DNS query and go about it's business.  Of
> course, that may be upwards of 45 seconds -- very annoying.

Yes.  We are able to log in to the machine eventually and restart the
offending processes.  But that is little consolation to our users who
notice the hang and the fallout afterward.

 - Pat
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> Al, any ideas? I have this feeling that the simplest fix is just to leave
> the race open, and make truncate_complete_page() just leave such a "racy"
> page in the page cache. It will still race, and the invalid page will
> still exist, but the end result should be harmless.

Provided that we clean it - why the hell do we want to take it out of
the pagecache? I don't see any fundamental reasons to prohibit pages
past the ->i_size being hashed. Methods _must_ check for ->i_size,
but they do it anyway. All race-prevention is based on page locks and
->i_sem.

Yes, filemap_nopage() should check for i_size at the very end and fail if
the page became off-limits. But that's completely unrelated issue - it's
mmap semantics, not pagecache one.

The point being: we should _never_ drop ->mapping unless the page is
irrevocably going away. We can (and probably should) drop the off-limits
page as soon as ->count hits zero, but we should not do it before that.

Comments?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: getting include-files from arch//subdir

2000-10-23 Thread Amit D Chaudhary


why not 
#include 

Amit


"Heusden, Folkert van" wrote:
> 
> I need to include (in a driver) a header-file from arch//subdir. I
> could, of course,
> do something like #include "../../arch/i386/{etc}" with a couple of #ifdef's
> to get things
> working for each environment. I guess that's now the way to do it cleanly.
> What would be _the_ way to do it?
> 
> Thanks.
> 
> Folkert van Heusden.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

VMWare and kswapd

2000-10-23 Thread Michael Rothwell


I'm trying out the new VMWare for Linux, and noticed that if I let it
run for a couple of days, I get this:

Oct 16 15:07:49 cartman kernel: VM: do_try_to_free_pages failed for
kswapd...

... after which things go to hell pretty fast. The previous VMWare would
oops the kernel so badly that I would go from working away to seeign the
BIOS screen suddenly. Unfortunately, the new problem doesn't produce any
other data, even and oops. If I produce memory contention (running
Netscape at the same time as VMWare, for instance), it happens sooner.

This is on a 2.2.16+USB kernel.

I know the LKML isn't VMWare tech support. I'm just wondering if anyone
else has been getting similar results but better data so that a bug
report can be sent to VMWare.

-M
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

RE: Linux's implementation of poll() not scalable?

2000-10-23 Thread David Schwartz



> Under Solaris 7, when the number of idle sockets was increased from
> 100 to 1, the time to check for active sockets with poll()
> increased by a factor of only 6.5.  That's a sublinear increase in time,
> pretty spiffy.

Under Solaris 7, when the number of idle sockets was decreased from 10,000
to 100, the time to check for active sockets with poll() decreased by a
factor of only 6.5. Shouldn't it be more like 100? Sounds like Solaris' poll
implementation is horribly broken for the most common case.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Petr Vandrovec wrote:
> > I'll take a better look at the truncate case (I consider the invalidate
> > case closed). Do you have a simple test-program around?
> 
> Well, I cannot say simple. As I was not able to reproduce it with only
> one task, code below:

Ok, without running this I can already guess at what's up.

One process does the "truncate()", and races with another process that
does a page-in.

The truncate code does

notify_change(ATTR_SIZE);

which will basically cause a "vmtruncate(inode, newsize)".

Just before this happened, the other process gets a page fault, and does a
page_cache_read(). Now, because we are low on memory (this is why it only
shows up when you're swapping), that read will take a while. During that
time, the vmtruncate() starts to execute, and sets inode->i_size to the
new value (that would cause us to no longer accept a page fault - but we
already got past that check in the faulting process). It will then
invalidate all the inode pages > i_size.

Now, the page faulting process comes back, and puts the page into the VM
space. Never mind that it has in the meantime gotten i_mapping = NULL due
to the other process doing the truncate.

Now, if I'm right, you should be able to add something like

if (!old_page->mapping)
printk("mapping went away from under us\n");

to just before the "return old_page()" case in the success path of
filemap_nopage() (mm/filemap.c), and you should see that printk() trigger
when the bug happens.

(There are other users too that are not synchronized wrt inode size
changes and could get an access to a page past the end of the file this
way - I just think that filemap_nopage is probably the one where this is
most easily seen).

The above is obviously not a bug-fix, it's just a validation of the
theory.

Al, any ideas? I have this feeling that the simplest fix is just to leave
the race open, and make truncate_complete_page() just leave such a "racy"
page in the page cache. It will still race, and the invalid page will
still exist, but the end result should be harmless.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] cpu detection fixes for test10-pre4

2000-10-23 Thread Pavel Machek


Hi!

> >> > * include/asm-i386/elf.h:
> >> >   - make Pentium IV and other post-P6 processors use the "i686"
> >> > family name (same fix as the system_utsname.machine init fix
> >> > which went into include/asm-i386/bugs.h in test10-pre4)
> >> > 
> >> 
> >> We should never have used anything but "i386" as the utsname... sigh.
> >
> >We stil can do that: make 2.4.0 use i386 for utsname on all x86 machines
> >(except x86-64 ;-), and let people adapt.
> 
> Better yet, make it use "ia32" to avoid confusion.

No. i386 is already in use by most people. No need to change that.
Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

1 2 >

1 - 100 of 190 matches

Mail list logo