Re: Topic for discussion: OS Design
On Mon, Oct 23, 2000 at 12:51:16PM -0700, [EMAIL PROTECTED] wrote: > On Sun, 22 Oct 2000, Dwayne C . Litzenberger wrote: > > This user also wants a > > smooth GUI, a mouse pointer that doesn't flinch under load, > > Try andrea archangeli's VM patches. When I use those patches X gets much > smoother and xmms (with nice -5) never skips. 2.2 VM sucks, film at 11. What the realtion of these patches with Rick's new VM architecture for 2.4.x ? Will 2.4.x give similar performance you mentioned with andrea's patches ? - Gabor - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
XFree86 detects a "GeForce DDR" on my Athlon machine. Will CONFIG_FB_RIVA be useful to me?
Hi, My subject line says it all. I have an Athlon machine with a GeForce DDR video chipset. Is there benefit to my compiling the kernel with CONFIG_FB_RIVA enabled? The "Help" associated with the option mentions the TNT series, but not the GeForce series. Maybe they are the same? Miles - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
test10-pre5 -- Compile error drivers/video/video.o: In function `vesafb_set_disp': undefined references
I am experimenting with compiling lots of stuff as modules. I hit what is either a user error, a configuration script bug or a symbol export bug. ld -m elf_i386 -T /usr/src/linux/arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o \ --start-group \ arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o ipc/ipc.o \ drivers/block/block.o drivers/char/char.o drivers/misc/misc.o drivers/net/net.o drivers/media/media.o drivers/parport/parport.a drivers/ide/idedriver.o drivers/cdrom/cdrom.a drivers/pci/pci.a drivers/video/video.o \ net/network.o \ /usr/src/linux/arch/i386/lib/lib.a /usr/src/linux/lib/lib.a /usr/src/linux/arch/i386/lib/lib.a \ --end-group \ -o vmlinux drivers/video/video.o: In function `vesafb_set_disp': drivers/video/video.o(.text+0x6811): undefined reference to `fbcon_cfb8' drivers/video/video.o(.text+0x6818): undefined reference to `fbcon_cfb16' drivers/video/video.o(.text+0x6821): undefined reference to `fbcon_cfb24' drivers/video/video.o(.text+0x6828): undefined reference to `fbcon_cfb32' Here is the associated part of .config: # # Console drivers # CONFIG_VGA_CONSOLE=y CONFIG_VIDEO_SELECT=y # # Frame-buffer support # CONFIG_FB=y CONFIG_DUMMY_CONSOLE=y CONFIG_FB_VESA=y CONFIG_VIDEO_SELECT=y CONFIG_FBCON_ADVANCED=y CONFIG_FBCON_MFB=m CONFIG_FBCON_CFB2=m CONFIG_FBCON_CFB4=m CONFIG_FBCON_CFB8=m CONFIG_FBCON_CFB16=m CONFIG_FBCON_CFB24=m CONFIG_FBCON_CFB32=m CONFIG_FBCON_VGA=m CONFIG_FBCON_FONTS=y CONFIG_FONT_8x8=y CONFIG_FONT_8x16=y - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Dan Kegel wrote: > > kqueue lets you associate an arbitrary integer with each event > specification; the integer is returned along with the event. > This is very handy for, say, passing the 'this' pointer of the > object that should handle the event. Yes, you can simulate it > with an array indexed by fd, but I like having it included. I agree. I thin the ID part is the most important part of the interfaces, because you definitely want to re-use the same functions over and over again - and you wan tto have some way other than just the raw fd to identify where the actual _buffers_ for that IO is, and what stage of the state machine we're in for that fd etc etc. Also, it's potentially important to have different "id"s for even the same fd - in some cases you want to have the same event handle both the read and the write part on an fd, but in other cases it might make more conceptual sense to separate out the read handling from the write handling, and instead of using "mask = POLLIN | POLLOUT", you'd just have two separate events, one with POLLIN and one with POLLOUT. This was what my "unsigned long id" was, but that is much too hard to use. See my expanded suggestion of it just a moment ago. (And yes, I'm sure you can do all this with kevents. I just abhor the syntax of those things). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Dan Kegel wrote: > > >http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html > describes the FreeBSD kqueue interface for events: I've actually read the BSD kevent stuff, and I think it's classic over-design. It's not easy to see what it's all about, and the whole tuple crap is just silly. Looks much too complicated. I don't believe in the "library" argument at all, and I think multiple event queues completely detract from the whole point of being simple to use and implement. Now, I agree that my bind_event()/get_event() has limitations, and the biggest one is probably the event "id". It needs to be better, and it needs to have more structure. The "id" really should be something that not only contains the "fd", but also contains the actor function to be called, along with some opaque data for that function. In fact, if you take my example server, and move the "handle[id]()" call _into_ get_events() (and make the "handle[id]()" function pointer a part of the ID of the event), then the library argument goes away too: it doesn't matter _who_ calls the get_event() function, because the end result is always going to be the same regardless of whether it is called from within a library or from a main loop: it's going to call the function handle associated with the ID that triggered. Basically, the main loop would boil down to for (;;) { static struct event ev_list[MAXEV]; get_event(ev_list, MAXEV, &tmout); .. timeout handling here .. } because get_even() would end up doing all the user-mode calls too (so "get_event()" is no longer a system call: it's a system call + a for-loop to call all the ID handler functions that were associated with the events that triggered). So the "struct event" would just be: struct event { int fd; unsigned long mask; void *opaque; void (*event_fn)(ind fd, unsigned long mask, void *opaque); } and there's no need for separate event queues, because the separate event queues have been completely subsumed by the fact that every single event has a separate event function. So now you'd start everything off (assuming the same kind of "listen to everything and react to it" server as in my previous example) by just setting bind_event(sock, POLLIN, NULL, accept_fn); which basically creates the event inside the kernel, and will pass it to the "__get_event()" system call through the event array, so the get_event() library function basically looks like int get_event(struct event *array, int maxevents, struct timeval *tv) { int nr = __get_event(array, maxevents, tv); int i; for (i = 0; i < nr; i++) { array->event_fn(array->fd, array->mask, array->opaque); array++; } return nr; } and tell me why you'd want to have multiple event queues any more? (In fact, you might as well move the event array completely inside "get_event()", because nobody would be supposed to look at the raw array any more. So the "get_event()" interface would be even simpler: just a timeout, nothing more. Talk about simple programming. (This is also the ideal event programming interface - signals get racy and hard to handle, while in the above example you can trivially just be single-threaded. Which doesn't mean that you CANNOT be multi-threaded if you want to: you multi-thread things by just having multiple threads that all call "get_event()" on their own). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Followup to: <[EMAIL PROTECTED]> By author:Dave Zarzycki <[EMAIL PROTECTED]> In newsgroup: linux.dev.kernel > > Maybe I'm missing something, but why do you seperate out fd from the event > structure. Why not just "int bind_event(struct event *event)" > > The only thing I might have done differently is allow for multiple event > queues per process, and the ability to add event queues to event > queues. But these features might not be all that useful in real life. > This could be useful if it doesn't screw up the implementation too much. Pretty much, what Linus is saying is the following: select()/poll() have one sizable cost, and that is to set up and destroy the set of events we want to trigger on. We would like to amortize this cost by making it persistent. It would definitely be useful for the user to have more than one such "event set" installed at any one time, so that you can call different wait_for_event() [or whatever] as appropriate. However, if that means we're doing lots of constructing and deconstructing in kernel space, then we probably didn't gain much. The other things I think we'd really like in a new interface is an interface where you can explicitly avoid the "storming hordes" problem -- if N servers is waiting for the same event, it should be at least possible to tell the kernel to only wake up one (arbitrarily chosen) of them, rather than all. Finally, it would be somewhat nice to have a unified interface for synchronous and asynchronous notification. This should be quite easily doable by adding a call event_notify(event_set,signal) that causes real-time signal "signal" to be raised (presumably with the event_set as the argument), when the specified event_set triggers. -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel wrote: > [kqueue is] Pretty similar to yours, with the following additions: > > Your proposal seems to only have one stream of available events per > process. kqueue() returns a handle to an event queue, and kevent() > takes that handle as a first parameter. > > [kqueue] uses a single call to both bind (or unbind) an array of fd's > and pick up new events. Probably reduces overhead. > > [kqueue] allows you to watch not just sockets, but also plain files > or directories. (Hey, haven't we heard people talk about letting apps > do that under Linux? What interface does that use?) I forgot to mention: kqueue lets you associate an arbitrary integer with each event specification; the integer is returned along with the event. This is very handy for, say, passing the 'this' pointer of the object that should handle the event. Yes, you can simulate it with an array indexed by fd, but I like having it included. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP
On 2000-10-23, Jeff Garzik <[EMAIL PROTECTED]> wrote: > Hardware: > Dual P-II 400 Mhz > 128 MB RAM > 13GB hard drive > First test was with 2.4.0-test10-pre3. > Next four tests were with 2.4.0-test10-pre4. > Final four tests were with 2.2.18-pre17. Would it be meaningful to run two concurrent LMbench runs on SMP 2.2.18-p17 vs two concurrent runs on 2.4.0-t10-p4 ? -- Hank Leininger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.2.17 with RedHat 7 Problem !
On Mon, Oct 23, 2000 at 12:06:31PM +, David Wragg wrote: > Gregory Maxwell <[EMAIL PROTECTED]> writes: > > If 2.96 is broken, I'd appreciate it if you would describe the breakage. > > As in the RedHat 2.96? Try compiling the following on RedHat 7.0 x86 > with "gcc -O2" and take a look at the generated code. Nice, isn't it? > > > #include > > void foo(void) > { > struct itimerval iv; > > iv.it_interval.tv_sec = 0; > iv.it_interval.tv_usec = 25; > iv.it_value = iv.it_interval; > > setitimer(ITIMER_REAL, &iv, NULL); > } Yes, this is a bug in the compiler (which I hope to fix today, CVS gcc is broken as well), though the actual place which causes this to be miscompiled is in the system headers where a restrict keyword is used on an incomplete struct timeval forward definitions pointer and due to bug is set in the type structure itself (at least that's my guess, need to run it under debugger today - but if the select prototype is moved after the full struct timeval definition, everything works correctly). Note that gcc 2.95.2 has some restrict keyword related bugs as well (which glibc had to work around in the headers; the bug was in 2.95.x only), it is not just 2.96. Jakub - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Linus Torvalds wrote: > where you say "I want an array of pending events, and I have an array you > can fill with up to 'maxnr' events - and if you have no events for me, > please sleep until you get one, or until 'tmout'". > > The above looks like a _really_ simple interface to me. Much simpler than > either select() or poll(). Agreed? Totally. I've been wanting an API like this in user-space for a long time. One question about "int bind_event(int fd, struct event *event)" Maybe I'm missing something, but why do you seperate out fd from the event structure. Why not just "int bind_event(struct event *event)" The only thing I might have done differently is allow for multiple event queues per process, and the ability to add event queues to event queues. But these features might not be all that useful in real life. Hmmm... It might be cute if you could do something like this: int num_of_resulting_events; int r, q = open("/dev/eventq"); struct event interested_events[1024]; struct event events_that_happened[1024]; /* fill up interested_event_array */ write(q, interested_events, sizeof(interested_events)); r = read(q, events_that_happened, sizeof(events_that_happened)); num_of_resulting_events = r / sizeof(struct event); davez -- Dave Zarzycki http://thor.sbay.org/~dave/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Here's a suggested "good" interface that would certainly be easy to > implement, and very easy to use, with none of the scalability issues that > many interfaces have. ... > It boils down to one very simple rule: dense arrays of sticky status > information is good. So let's design a good interface for a dense array. > > Basically, the perfect interface for events would be > > struct event { > unsigned long id; /* file descriptor ID the event is on */ > unsigned long event; /* bitmask of active events */ > }; > > int get_events(struct event * event_array, int maxnr, struct timeval >*tmout); > > int bind_event(int fd, struct event *event); http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html describes the FreeBSD kqueue interface for events: struct kevent { uintptr_t ident;/* identifier for this event */ short filter; /* filter for event */ u_short flags;/* action flags for kqueue */ u_int fflags; /* filter flag value */ intptr_t data; /* filter data value */ void *udata; /* opaque user data identifier */ }; int kqueue(void); int kevent(int kq, const struct kevent *changelist, int nchanges, struct kevent *eventlist, int nevents, const struct timespec *timeout); Pretty similar to yours, with the following additions: Your proposal seems to only have one stream of available events per process. kqueue() returns a handle to an event queue, and kevent() takes that handle as a first parameter. Their proposal uses a single call to both bind (or unbind) an array of fd's and pick up new events. Probably reduces overhead. Their proposal allows you to watch not just sockets, but also plain files or directories. (Hey, haven't we heard people talk about letting apps do that under Linux? What interface does that use?) > The really nice part of the above is that it's trivial to implement. It's > about 50 lines of code, plus some simple logic to various drivers etc to > actually inform about the events. The way to do this simply is to limit it > in very clear ways, the most notable one being simply that there is only > one event queue per process (or rather, per "struct files_struct" - so > threads would automatically share the event queue). This keeps the > implementation simple, but it's also what keeps the interfaces simple: no > queue ID's to pass around etc. I dislike having only one event queue per process. Makes it hard to use in a library. ("But wait, *I* was going to use bind_event()!") > Advantage: everything is O(1), except for "get_event()" which is O(n) > where 'n' is the number of active events that it fills in. This is a big win, and is exactly the payoff that Solaris gets with /dev/poll. > Example "server": See http://www.monkeys.com/kqueue/echo.c for a very similar example server for kqueue() / kevent(). > You get the idea. Very simple, and looks like it should perform quite > admirably. With none of the complexities of signal handling, and none of > the downsides of select() or poll(). Both of the new system calls would be > on the order of 20-30 lines of code (along with the infrastructure to > extend the fasync stuff to also be able to handle events) Go, Linus, go! Let's do it! But don't go *too* simple. I really would like a few of the things kqueue has. > Yes, I'm probably missing something. But this is the kind of thing I think > we should look at (and notice that you can _emulate_ this with poll(), so > you can use this kind of interface even on systems that wouldn't support > these kinds of events natively). - Dan p.s. my collection of notes on kqueue is at http://www.kegel.com/c10k.html#nb.kqueue - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Note that if there really are only 9 "nopage" routines, then it is a lot > easier to just add the single "SetPageUptodate(page)" into those 9 > routines, and thus let the VM know of the race. Works for me. And yes, the list of ->nopage instances that can return success is that short: drm_vm_shm_nopage drm_vm_shm_nopage_lock drm_vm_dma_nopage sgi_graphics_nopage via_mm_nopage ncp_file_mmap_nopage shm_nopage shmzero_nopage filemap_nopage - sorry, it's not 9+filemap_nopage, it's 8+filemap_nopage. However, it still leaves a window for the race: we invalidate first and remove from pagetables later. And ClearPageUptodate() is obviously truncate_inode_pages() work. Grr... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Tue, 24 Oct 2000, Alexander Viro wrote: > > It's not the only problem, but I would feel _much_ safer if pagefault > wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the > insertion into page tables _within_ ->nopage()? NO! We used to do this a LOONG time ago. Distributing the VM information on how to properly insert a pte into the page table into drivers is _NOT_ something I want to do again. It was a pain to get it properly modularized, we're not going back to the horror. Note that if there really are only 9 "nopage" routines, then it is a lot easier to just add the single "SetPageUptodate(page)" into those 9 routines, and thus let the VM know of the race. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Jordan Mendelson wrote: > What you describe is exactly what the /dev/poll interface patch from the > Linux scalability project does. > > It creates a special device which you can open up and write > add/remove/modify entries you wish to be notified of using the standard > struct pollfd. Removing entries is done by setting the events in a > struct written to the device to POLLREMOVE. And that's an ugly crap. * you add struct {} into the kernel API. _Always_ a bad idea. * you either create yet another example of "every open() gives a new instance" kind of device or you got to introduce a broker process. * no easy way to check the current set. * no fscking way to use that from scripts/etc. > You can optionally mmap() memory which the notifications are written to. > Two ioctl() calls are provide for the initial allocation and also to > force it to check all items in your poll() list. * useless use of ioctl() award > Solaris has this same interface minus the mmap()'ed memory. Oh, yes. Solaris. Great example of good taste... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > What is your favourite interface then ? > > I suspect a good interface that can easily be done efficiently would > basically be something where the user _does_ do the equivalent of a > read-only mmap() of poll entries - and explicit and controlled > "add_entry()" and "remove_entry()" controls, so that the kernel can > maintain the cache without playing tricks. Actually, forget the mmap, it's not needed. Here's a suggested "good" interface that would certainly be easy to implement, and very easy to use, with none of the scalability issues that many interfaces have. First, let's see what is so nice about "select()" and "poll()". They do have one _huge_ advantage, which is why you want to fall back on poll() once the RT signal interface stops working. What is that? Basically, RT signals or any kind of event queue has a major fundamental queuing theory problem: if you have events happening really quickly, the events pile up, and queuing theory tells you that as you start having queueing problems, your latency increases, which in turn tends to mean that later events are even more likely to queue up, and you end up in a nasty meltdown schenario where your queues get longer and longer. This is why RT signals suck so badly as a generic interface - clearly we cannot keep sending RT signals forever, because we'd run out of memory just keeping the signal queue information around. Neither poll() nor select() have this problem: they don't get more expensive as you have more and more events - their expense is the number of file descriptors, not the number of events per se. In fact, both poll() and select() tend to perform _better_ when you have pending events, as they are both amenable to optimizations when there is no need for waiting, and scanning the arrays can use early-out semantics. So sticky arrays of events are good, while queues are bad. Let's take that as one of the fundamentals. So why do people still like RT signals? They do have one advantage, which is that you do NOT have that silly array traversal when there is nothing to do. Basically, the RT signals kind of approach is really good for the cases where select() and poll() suck: no need to traverse mostly empty and non-changing arrays all the time. It boils down to one very simple rule: dense arrays of sticky status information is good. So let's design a good interface for a dense array. Basically, the perfect interface for events would be struct event { unsigned long id; /* file descriptor ID the event is on */ unsigned long event;/* bitmask of active events */ }; int get_events(struct event * event_array, int maxnr, struct timeval *tmout); where you say "I want an array of pending events, and I have an array you can fill with up to 'maxnr' events - and if you have no events for me, please sleep until you get one, or until 'tmout'". The above looks like a _really_ simple interface to me. Much simpler than either select() or poll(). Agreed? Now, we still need to inform the kernel of what kind of events we want, ie the "binding" of events. The most straightforward way to do that is to just do a simple "bind_event()" system call: int bind_event(int fd, struct event *event); which basically says: I'm interested in the events in "event" on the file descriptor "fd". The way to stop being interested in events is to just set the event bitmask to zero. Now, the perfect interface would be the above. Nothing more. Nothing fancy, nothing complicated. Only really simple stuff. Remember the old rule: "keep it simple, stupid". The really nice part of the above is that it's trivial to implement. It's about 50 lines of code, plus some simple logic to various drivers etc to actually inform about the events. The way to do this simply is to limit it in very clear ways, the most notable one being simply that there is only one event queue per process (or rather, per "struct files_struct" - so threads would automatically share the event queue). This keeps the implementation simple, but it's also what keeps the interfaces simple: no queue ID's to pass around etc. Implementation is straightforward: the event queue basically consists of - a queue head in "struct files_struct", initially empty. - doing a "bind_event()" basically adds a fasync entry to the file structure, but rather than cause a signal, it just looks at whether the fasync entry is already linked into the event queue, and if it isn't (and the event is one of the ones in the event bitmask), it adds itself to the event queue. - get_event() just traverses the event queue and fills in the array, removing them from the event queue. End of story. If the event queue is empty, it trivially sees that in a single line of code (+ timeout handling) Advantage: everything is O(1), except for "get_event()" which is O(n) where 'n' is the number of active eve
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > On Tue, 24 Oct 2000, Andi Kleen wrote: > > > > I don't see the problem. You have the poll table allocated in the kernel, > > the drivers directly change it and the user mmaps it (I was not proposing > > to let poll make a kiobuf out of the passed array) > Th eproblem with poll() as-is is that the user doesn't really tell the > kernel explictly when it is changing the table.. What you describe is exactly what the /dev/poll interface patch from the Linux scalability project does. It creates a special device which you can open up and write add/remove/modify entries you wish to be notified of using the standard struct pollfd. Removing entries is done by setting the events in a struct written to the device to POLLREMOVE. You can optionally mmap() memory which the notifications are written to. Two ioctl() calls are provide for the initial allocation and also to force it to check all items in your poll() list. Solaris has this same interface minus the mmap()'ed memory. Jordan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > On Mon, 23 Oct 2000, Alexander Viro wrote: > > > > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what > > analysis had been done? Petr, can you reproduce the problem on -test7? > > I don't think that is it - that code looks very straightforward (and is > needed on some silly architectures that cannot easily otherwise see if > they need to be coherent wrt user space - mainly sparc and virtual > caches). OK, I see where the race can happen. Yes, vmtruncate() tries to kill the mappings. Right. However, it does that _after_ truncate_inode_pages(). And there is a window when the only lock we are holding is ->i_sem. sync_pte in that window ==> we are fucked. So the question being: WTF do we postpone zapping the page tables until after the truncate_inode_pages()? The following rules might make life simpler, AFAICS: * as soon as ->i_size is set, no new pagetable references to off-limits pages can appear. * as soon as we are don with vmtruncate_list() there is no pagetable references. * truncate_inode_pages() never has to deal with pages refered from pagetables. * ->i_size can't increase until we return from vmtruncate(). It's not the only problem, but I would feel _much_ safer if pagefault wouldn't rely on pagecache miss. Actually... Hey. Why don't we do the insertion into page tables _within_ ->nopage()? Look: let's take the tail of do_no_page() into helper function and just call it from the end of every bloody ->nopage() out there. It _is_ easy: we have only 9 instances in the tree not counting filemap_nopage(). Moreover, do_anonymous_page() will become symmetrical to the rest of the crowd. Comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
USB Printer, in 2.4.0-test9
Strange things here. I'm testing out 2.4.0-test9 kernel with USB, (reiserfs built in, but, hopefully this has nothing to do with it). Hardware is a 440FX Dual PPro200/Natoma + 82371SB PIIX3/USB. Printer: HP DeskJet 880C USB/Parallel Mouse: Microsoft Intellimouse with Intellieye USB Other stuff: Belkin "Macintosh" USB hub Symptoms: USB Mouse appears to work totally fine. USB Printer will print a few bytes, and suddenly print garbage (bitmap/pcl). I tried printing out some plain text and it's MANGLING bits - corrupting random bytes of data. The general structure of bytes are still there, but the resultant printout is gibberish. (source print file is a text file printed via "cat file >/dev/usblp0" (which is device 180,0)) I get a bunch of form feeds too but it continues to print a few characters fine and some that are totally wrong. It looks like it's corrupting about 5% of the characters, including some high bit 7 characters. Any ideas what's going on, and is this repeatable by anyone else? Bad hardware? This SMP box only supports one form of the MPS, and I'm not sure how to tell the difference... I also tried using the printer w/o the hub, and same results... Thanks! -bc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: serial problems
Date:Sat, 21 Oct 2000 23:45:58 +0200 From: octave klaba <[EMAIL PROTECTED]> I use the serial cart with 5.03 / 2.2.17 Serial driver version 5.03 (2000-08-11) with MANY_PORTS SHARE_IRQ SERIAL_PCI enabled 00:0d.0 Serial controller: Timedia Technology Co Ltd: Unknown device 7168 (rev 01) I realized it can crash easy a system: run kermit on a ttySX to an another box. when you get the terminal just disconnect the cable and connect it on a another serial port (without any output) now quit kermit: your system should be crashed. Can you actually give me some details of how your system "crashed"? It certainly shouldn't have. Kermit will sometimes hang waiting for the terminal to flush if it's enabled hardware flow control and there are characters pending to be flushed. ^Z will generally break out of the waiting loop, at which point you can kill the kermit process with a kill command. ^C will also break you out back to the kermit prompt, at which point a second "quit" command will generally work. another bug (known bug I think) is: if you use a serial cable not connected, your system crashs on the boot. Huh? I have several machines without a serial cable, and it certainly doesn't crash on boot. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Andi Kleen wrote: > > I don't see the problem. You have the poll table allocated in the kernel, > the drivers directly change it and the user mmaps it (I was not proposing > to let poll make a kiobuf out of the passed array) That's _not_ how poll() works at all. We don't _have_ a poll table in the kernel, and no way to mmap it. The poll() tables gets created dynamically based on the stuff that the user has set up in the table. And the user can obviously change the fd's etc in the table directly, so in order for the caching to work you need to do various games with page table dirty or writable bits, or at least test dynamically whether the poll table is the same as it was before. Sure, it's doable, and apparently Solaris does something like this. But what _is_ the overhead of the Solaris code for small number of fd's? I bet it really is quite noticeable. I also suspect it is very optimized toward an unchangning poll-table. > What is your favourite interface then ? I suspect a good interface that can easily be done efficiently would basically be something where the user _does_ do the equivalent of a read-only mmap() of poll entries - and explicit and controlled "add_entry()" and "remove_entry()" controls, so that the kernel can maintain the cache without playing tricks. Basically, something like a user interface to something that looks like the linux poll_table_page structures, with the difference being that it doesn't have to be created and torn down all the time because the user would explicitly ask for "add this fd" and "remove this fd" from the table. Th eproblem with poll() as-is is that the user doesn't really tell the kernel explictly when it is changing the table.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Nick Piggin ([EMAIL PROTECTED]) wrote: > > I'm trying to write a server that handles 1 clients. On 2.4.x, > > the RT signal queue stuff looks like the way to achieve that. > > I would suggest you try multiple polling threads. Not only will you get > better SMP scalability, if you have say 16 threads, each one only has to > handle ~ 600 fds. Good point. My code is already able to use multiple network threads, and I have done what you suggest in the past. But I'm interested in pushing the state of the art here, and want to see if Linux can handle it with just a single network thread. (My server has enough non-network threads to keep multiple CPUs busy, don't worry :-) - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
David Schwartz wrote: > > I'm trying to write a server that handles 1 clients. On 2.4.x, > > the RT signal queue stuff looks like the way to achieve that. > > Unfortunately, when the RT signal queue overflows, the consensus seems > > to be that you fall back to a big poll(). And even though the RT signal > > queue [almost] never overflows, it certainly can, and servers have to be > > able to handle it. > > Don't let that bother you. In the case where you get a hit a significant > fraction of the descriptors you are polling on, poll is very efficient. The > inefficiency comes when you have to wade through 10,000 uninteresting file > descriptors to find the one interesting one. If the poll set is rich in > ready descriptors, there is little advantage to signal queues over poll > itself. > > In fact, if you assume the percentage of ready file descriptors (as opposed > to the number of file descriptors) is constant, then poll is just as > scalable (theoretically) as any other method. Under both schemes, with twice > as many file descriptors you have to do twice as much work. Yep, I've made similar arguments myself. It's just that seeing poll() take 14 milliseconds to return on a 650 MHz system is a little daunting. I'll report again when I have results for RT signal stuff and different percentages of idle sockets (probably 0, 1, and 10). - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > > Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what > analysis had been done? Petr, can you reproduce the problem on -test7? I don't think that is it - that code looks very straightforward (and is needed on some silly architectures that cannot easily otherwise see if they need to be coherent wrt user space - mainly sparc and virtual caches). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[drm:drm_release] *ERROR* Process 256 dead
Sorry if this is user-error, but after about 20min of using 2.4.0-test10-pre5, my Debian Woody system dropped out of X with this message in syslog: [drm:drm_release] *ERROR* Process 256 dead, freeing lock for context 1 I've never seen this before; I had been using test10-pre4 for several days without error. Is this a kernel bugglet/error, or X? -- Burton Windle [EMAIL PROTECTED] Linux: the "grim reaper of innocent orphaned children." from /usr/src/linux/init/main.c:1384 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Oh, crap... Who introduced ->i_mmap_shared/->i_mmap separation and what analysis had been done? Petr, can you reproduce the problem on -test7? Unfortunately, clean test would take the backport of ext2 changes (truncate-related, happened around the same time), but IIRC -test7 was relatively stable... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel wrote: > > Jordan Mendelson ([EMAIL PROTECTED]) wrote: > > An implementation of /dev/poll for Linux already exists and has shown to > > be more scalable than using RT signals under my tests. A patch for 2.2.x > > and 2.4.x should be available at the Linux Scalability Project @ > > http://www.citi.umich.edu/projects/linux-scalability/ in the patches > > section. > > If you'll look at the page I linked to in my original post, > http://www.kegel.com/dkftpbench/Poller_bench.html > you'll see that I also benchmarked /dev/poll. The Linux /dev/poll implementation has a few "non-standard" features such as the ability to mmap() the poll structure memory to eliminate a memory copy. int dpoll_fd; unsigned char *dpoll; struct pollfd *mmap_dpoll; dpoll_fd = open("/dev/poll", O_RDWR, 0); ioctl(dpoll_fd, DP_ALLOC, 1); dpoll = mmap(0, DP_MMAP_SIZE(1), PROT_WRITE|PROT_READ, MAP_SHARED, dpoll_fd, 0); dpoll = (struct pollfd *)mmap_dpoll; Use this memory when reading and write() to add/remove and see if you get any boost in performance. Also, I believe there is a hash table associated with /dev/poll in the kernel patch which might slow down your performance tests when it's first growing to resize itself. Jordan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Getting started
Hello, I'm a senior in high school with a fair bit of programming experience looking to get involved with the linux kernel. I'm currently reading a book on OS design and last summer I hacked the IP masquerading and port forwarding code for an internal company project so I'm not completely unknowledgable. What's a good project to start with? I've heard that device drivers are a fairly easy starting point. Unfortunately, all of my hardware already works with linux. Is there a list of hardware that does not work with linux anywhere? Can someone suggest an elegant, fairly simply device driver that I can take a look at? I also know a little bit of TCP/IP networking. Are there any small features currently missing from Linux's TCP/IP code? No need to reply directly to me, I'm subscribed. Thanks in advance, Avery Fay - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Linux's implementation of poll() not scalable?
> I'm trying to write a server that handles 1 clients. On 2.4.x, > the RT signal queue stuff looks like the way to achieve that. > Unfortunately, when the RT signal queue overflows, the consensus seems > to be that you fall back to a big poll(). And even though the RT signal > queue [almost] never overflows, it certainly can, and servers have to be > able to handle it. Don't let that bother you. In the case where you get a hit a significant fraction of the descriptors you are polling on, poll is very efficient. The inefficiency comes when you have to wade through 10,000 uninteresting file descriptors to find the one interesting one. If the poll set is rich in ready descriptors, there is little advantage to signal queues over poll itself. In fact, if you assume the percentage of ready file descriptors (as opposed to the number of file descriptors) is constant, then poll is just as scalable (theoretically) as any other method. Under both schemes, with twice as many file descriptors you have to do twice as much work. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Netscape mail sometimes frezze when i read mail stored in a vfat partition
Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit [1.] One line summary of the problem: Netscape mail sometimes frezze when i read mail stored in a vfat partition [2.] Full description of the problem/report: When it happen the dmesg show as a kernel bug, and netscape got Z state: 213 tty1 Z 0:00 [netscape ] I use netscape with nsmail linked with a Maildir of windows version of netscape to use the same folders for linux and windows. It happen with kernel-2.4.0test8 .. and i upgrade to test9 because i see some updates and file.c but in ext2 fs not in vfat fs ... [3.] Keywords (i.e., modules, networking, kernel): vfat,netscape,file.c [4.] Kernel version (from /proc/version): Linux version 2.4.0-test9 (root@soft) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 sex out 20 12:23:54 BRST 2000 [5.] Output of Oops.. message (if applicable) with symbolic information resolved (see Documentation/oops-tracing.txt) kernel BUG at file.c:79! invalid operand: CPU:0 EIP:0010:[] EFLAGS: 00013296 eax: 0019 ebx: ecx: c3e793e0 edx: 0005 esi: 0020 edi: c170d800 ebp: c0e882c0 esp: c2d89e54 ds: 0018 es: 0018 ss: 0018 Process netscape (pid: 198, stackpage=c2d89000) Stack: c0294645 c02947c7 004f c015f868 0200 0004 c2d89eb8 0010 c0131f16 c170d800 0020 c0e882c0 0001 c103e1c4 0004 1000 c0e52ea0 c2d89eb8 c0e882c0 c0e9d000 0200 0020 Call Trace: [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] Code: 0f 0b 83 c4 0c 66 8b 47 20 66 89 45 0c 89 5d 04 80 4d 18 30 [6.] A small shell script or example program which triggers the problem (if possible) Netscape 4.75 [7.] Environment [7.1.] Software (add the output of the ver_linux script here) Distribution: Slackware 7.1 Netscape 4.75 Linux soft 2.4.0-test9 #1 sex out 20 12:23:54 BRST 2000 i586 unknown Kernel modules 2.3.13 Gnu C egcs-2.91.66 Gnu Make 3.79 Binutils 2.9.1.0.25 Linux C Library2.1.3 Dynamic linker ldd: version 1.9.9 Procps 2.0.6 Mount 2.10l Net-tools 1.55 Kbd0.99 Sh-utils 2.0 Modules Loaded [7.2.] Processor information (from /proc/cpuinfo): processor : 0 vendor_id : AuthenticAMD cpu family : 5 model : 8 model name : AMD-K6(tm) 3D processor stepping: 0 cpu MHz : 300.000689 cache size : 64 KB fdiv_bug: no hlt_bug : no sep_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr mce cx8 sep mmx 3dnow bogomips: 599.65 [7.3.] Module information (from /proc/modules): lp 5308 2 (autoclean) [7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem) /proc/ioports -001f : dma1 0020-003f : pic1 0040-005f : timer 0060-006f : keyboard 0080-008f : dma page reg 00a0-00bf : pic2 00c0-00df : dma2 00f0-00ff : fpu 0170-0177 : ide1 01f0-01f7 : ide0 0213-0213 : isapnp read 0220-022f : soundblaster 02f8-02ff : serial(auto) 0330-0333 : MPU-401 UART 0376-0376 : ide1 03bc-03be : parport0 03c0-03df : vga+ 03f6-03f6 : ide0 03f8-03ff : serial(auto) 0620-0623 : sound driver (AWE32) 0a20-0a23 : sound driver (AWE32) 0a79-0a79 : isapnp write 0cf8-0cff : PCI conf1 0e20-0e23 : sound driver (AWE32) 5c20-5c3f : Acer Laboratories Inc. [ALi] M7101 PMU b400-b40f : Acer Laboratories Inc. [ALi] M5229 IDE b400-b407 : ide0 b408-b40f : ide1 b800-b81f : Realtek Semiconductor Co., Ltd. RTL-8029(AS) b800-b81f : NE2000 d000-dfff : PCI Bus #01 d800-d8ff : 3Dfx Interactive, Inc. Voodoo Banshee /proc/iomem -0009 : System RAM 000a-000b : Video RAM area 000c-000c7fff : Video ROM 000f-000f : System ROM 0010-07ffbfff : System RAM 0010-002f195f : Kernel code 002f1960-0031aae3 : Kernel data 07ffc000-07ffefff : ACPI Tables 07fff000-07ff : ACPI Non-volatile Storage de00-dfff : PCI Bus #01 de00-dfff : 3Dfx Interactive, Inc. Voodoo Banshee e000-e3ff : Acer Laboratories Inc. [ALi] M1541 e5f0-e7ff : PCI Bus #01 e600-e7ff : 3Dfx Interactive, Inc. Voodoo Banshee - : reserved [7.5.] PCI information ('lspci -vvv' as root) 00:00.0 Host bridge: Acer Laboratories Inc. [ALi] M1541 (rev 04) Subsystem: Acer Laboratories Inc. [ALi] ALI M1541 Aladdin V/V+ AGP System Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=slow >TAbort- SERR- Capabilities: [e0] #00 [] 00:01.0 PCI bridge: Acer Laboratories Inc. [ALi] M5243 (rev 04) (prog-if 00 [Normal decode]) Control: I/O+ Me
[PATCH] fs/nls/Config.in -- Again
This is the suggestion that Petr made and was approved by the maintainer. Can we get this in 2.4.0-test10-pre6 please? Without this patch, if you have CONFIG_INET turned off, you have to go through the CONFIG_NLS stuffs. -- Tom Rini (TR1265) http://gate.crashing.org/~trini/ --- fs/nls/Config.in.orig Thu Oct 19 12:54:09 2000 +++ fs/nls/Config.inThu Oct 19 12:54:32 2000 @@ -2,10 +2,17 @@ # Native language support configuration # +# smb wants NLS +if [ "$CONFIG_SMB_FS" = "m" -o "$CONFIG_SMB_FS" = "y" ]; then + define_bool CONFIG_SMB_NLS y +else + define_bool CONFIG_SMB_NLS n +fi + # msdos and Joliet want NLS if [ "$CONFIG_JOLIET" = "y" -o "$CONFIG_FAT_FS" != "n" \ -o "$CONFIG_NTFS_FS" != "n" -o "$CONFIG_NCPFS_NLS" = "y" \ - -o "$CONFIG_SMB_FS" != "n" ]; then + -o "$CONFIG_SMB_NLS" = "y" ]; then define_bool CONFIG_NLS y else define_bool CONFIG_NLS n
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 06:42:39PM -0700, Linus Torvalds wrote: > > > On Tue, 24 Oct 2000, Andi Kleen wrote: > > > > Also with the poll table mmap'ed via /dev/poll and the optimizations I > > described poll could be made quite nice (I know that queued SIGIO exists, > > but it has its drawbacks too and often you need to fallback to poll anyways) > > The problem is that your proposed optimizations would be horrible: the > poll events themselves happen when some other process is running, so you'd > have to do some serious VM hacking and consider the poll table to be some > kind of direct-IO thing etc. Ugh. I don't see the problem. You have the poll table allocated in the kernel, the drivers directly change it and the user mmaps it (I was not proposing to let poll make a kiobuf out of the passed array) > > And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from > file -> potentially many [process,fd] pairs. Yes, but in 95% of all cases it is a 1:1 mapping, so you could just fall back to the old slow method when that isn't the case. > > Doing the caching across multiple poll-calls is even worse, you'd have to > cache where in user space people had the array etc. Not pretty. The file -> fdnum reverse table does not depend On every poll call you can walk the poll table and fix the pointer from file to poll entry. That can be done cheaply during copyin for normal poll. mmap'ed /dev/poll could probably use a lazy method (cache the pointers and verify and walk if it the verify fails) > I'm sure it can be speeded up, I'm just not sure it's really worth it if > you'd instead just have a better interface that wouldn't need this crap, > and consider poll() as just a compatibility layer. What is your favourite interface then ? ndi /dev/poll is nice in that it can be relatively easy hacked into older programs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Also, the fact that Petr didn't see anything trigger in nopage() makes me > nervous again. Even if the problem happened during read-ahead, it should > have gotten into the address space only through nopage. Maybe there is > some vma that isn't added to the right inode VM list - so that we end up > missing part of the vmtruncate() stuff? > > Al, any ideas? Just one: let's slow down a bit and try to write down the rules. Close to release or not, let's understand WTF happens in that part of VM before deciding on the choice of band-aids/fixes. Frankly, right now I don't feel that area - too many changes during the last month. So I'm going to sit down and read it through. IMO it's worth the delay - I have a gut feeling that band-aids are going to cost more. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
test10-pre5
Ok, the issue with Petr is still open, in the meantime the test10-pre5 stuff fixes various other small nagging issues (notably the silly extraneous BUG() tests that bit lots of people - sorry). Linus - - pre5: - Mikael Pettersson: more Pentium IV cleanup. - David Miller: non-x86 platforms missed "pte_same()". - Russell King: NFS invalidate_inode_pages() can do bad things! - Randy Dunlap: usb-core.c is gone - module fix - Ben LaHaise: swapcache fixups for the new atomic pte update code - Oleg Drokin: fix nm256_audio memory region confusion - Randy Dunlap: USB printer fixes - David Miller: sparc updates - David Miller: off-by-one error in /proc socket dumper - David Miller: restore non-local bind() behaviour. - David Miller: wakeups on socket shutdown() - Jeff Garzik: DEPCA net drvr fixes and CodingStyle - Jeff Garzik: netsemi net drvr fix - Jeff Garzik & Andrea Arkangeli: keyboard cleanup - Jeff Garzik: VIA audio update - Andrea Arkangeli: mxcsr initialization cleanup and fix - Gabriel Paubert: better twd_i387_to_fxsr() emulation - Andries Brouwer: proper error return in ext2 mkdir() - pre4: - disable writing to /proc/xxx/mem. Sure, it works now, but it's still a security risk. - IDE driver update (Victroy66 SouthBridge support) - i810 rng driver cleanup - fix sbus Makefile - named initializers in module.. - ppoe: remove explicit initializer - it's done with initcalls. - x86 WP bit detection: do it cleanly with exception handling - Arnaldo Carvalho de Melo: memory leaks in drivers/media/video - Bartlomiej Zolnierkiewicz: video init functions get __init - David Miller: get rid of net/protocols.c - they get to initialize themselves - David Miller: get rid of dev_mc_lock - we hold dev->xmit_lock anyway. - Geert Uytterhoeven: Zorro (Amiga) bus support update - David Miller: work around gcc-2.7.2 bug - Geert Uytterhoeven: mark struct consw's "const". - Jeff Garzik: network driver cleanups, ns558 joystick driver oops fix - Tigran Aivazian: clean up __alloc_pages(), kill_super() and notify_change() - Tigran Aivazian: move stuff from .data to .bss - Jeff Garzik: divert.h typename cleanups - James Simmons: mdacon using spinlocks - Tigran Aivazian: fix BFS free block calculation - David Miller: sparc32 works again - Bernd Schmidt: fix undefined C code (set/use without a sequence point) - Mikael Pettersson: nicer Pentium IV setup handling. - Georg Acher: usb-uhci cpia oops fix - Kanoj Sarcar: more node_data cleanups for [non]NUMA. - Richard Henderson: alpha update to new vmalloc setup - Ben LaHaise: atomic pte updates (don't lose dirty bit) - David Brownell: ohci memory debugging (== use separate slabs for allocation) - pre3: - update email address of Joerg Reuter - Andries Brouwer: spelling fixes, missing atari brelse(), breada() fix - Geert Uytterhoeven: used named initializers for "struct console". - Carsten Paeth: ISDN capifs - iput() only once. - Petr Vandrovec: VFAT short name generation fix - Jeff Garzik: i810_rng cleanup, and i815 chipset added. - Bartlomiej Zolnierkiewicz: clean up some remaining old-style Makefiles - Dave Jones: x86 setup fixes (recognize Pentium IV etc). - x86: do the "fast A20" setup too in setup.S - NIIBE Yutaka: update SuperH for the global page table (vmalloc) change. - David Miller: sparc updates (vmalloc stuff still pending) - David Miller: CodaFS warnings and 64-bit warnings in pci_size() - David Miller: pcnet32 - correct NULL test - David Miller: vmlist lock -> page_table_lock clarification - Trond Myklebust: Ouch. rpcauth_lookup_credcache() memory corruption bug - Matthew Wilcox: file locking cleanups - David Woodhouse: USB audio spinlock fixes - Torben Mathiasen: tlan driver cleanups - Randy Dunlap: Yenta: CACHE_LINE_SIZE is in dwords, not bytes. - Randy Dunlap: more USB updates - Kanoj Sarcar: clean up the NUMA interfaces (pg_data instead of nodes) - "save_fpu()" was broken. Need to clear pending errors: save_init_fpu(). - pre2: - remember to change the kernel version ;) - isapnp.txt bugfix - ia64 update - sparc update - networking update (pppoe init, frame diverter, fix tcp_sendmsg, fix udp_recvmsg). - Compile for WinChip must _not_ use "-march=i686". It's a i586. - Randy Dunlap: more USB updates - clarify the Firewire AIC-5800 situation. It's not supported yet. - PCI-space decode size fix. This is needed for some (broken?) hardware - /proc/self/maps off-by-one error - 3c501, 3c507, cs89x0 network drivers drop unnecessary check_region - Asahi Kasei AK4540: new codec ID. Yamaha: new PCI ID's. - ne2k-pci net driver documentation update - Paul Gortmaker: delete paranoia check in rtc_exit - scsi_merge: memset
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Andi Kleen wrote: > > Also with the poll table mmap'ed via /dev/poll and the optimizations I > described poll could be made quite nice (I know that queued SIGIO exists, > but it has its drawbacks too and often you need to fallback to poll anyways) The problem is that your proposed optimizations would be horrible: the poll events themselves happen when some other process is running, so you'd have to do some serious VM hacking and consider the poll table to be some kind of direct-IO thing etc. Ugh. And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from file -> potentially many [process,fd] pairs. Doing the caching across multiple poll-calls is even worse, you'd have to cache where in user space people had the array etc. Not pretty. I'm sure it can be speeded up, I'm just not sure it's really worth it if you'd instead just have a better interface that wouldn't need this crap, and consider poll() as just a compatibility layer. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > > I'm starting to suspect that we leave this path as-is, and just fix the > mapping case (and PageUptodate() can work there). That should also avoid > the nasties. ..and even that looks like I'd have to do the quick-and-dirty case with the race still there on SMP. Adding the page Uptodate logic to the VM layer proper is too painful at this point - every single nopage function would have to be updated to mark its page up-to-date, as they don't generally do that currently. Also, the fact that Petr didn't see anything trigger in nopage() makes me nervous again. Even if the problem happened during read-ahead, it should have gotten into the address space only through nopage. Maybe there is some vma that isn't added to the right inode VM list - so that we end up missing part of the vmtruncate() stuff? Al, any ideas? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 06:16:24PM -0700, Linus Torvalds wrote: > > > On Tue, 24 Oct 2000, Andi Kleen wrote: > > > > It would be possible to setup a file -> fdnum reverse table (possibly cached > > over poll calls, I think Solaris does that) and let the async events directly > > change the bits in the output buffer in O(1). > > I disagree. > > Let's just face it, poll() is a bad interface scalability-wise. It is, but Linux poll is far from being a good implementation even in the limitations of the interfaces, as the comparison with Solaris shows. It is inherently O(n), but the constant part of that in Linux could be certainly improved. Also with the poll table mmap'ed via /dev/poll and the optimizations I described poll could be made quite nice (I know that queued SIGIO exists, but it has its drawbacks too and often you need to fallback to poll anyways) > > > Also the current 2.4 poll is very wasteful both in memory and cycles > > for small numbers of fd. > > Yes, we could go back to the "optimize the case of n < 8" thing. Which > should be particularly easy with the new "poll_table_page" setup: the > thing is more abstracted than it used to be in 2.2. I did it for n < 40. It also put the wait queues on the stack, in that it is even more efficient than 2.0 I attached the patch for your reference, it is against test3 or so (but iirc select/poll has not changed after that). I'm not proposing it for inclusion right now. It does only the easy parts of course, I think with some VFS extensions it could be made a lot better. BTW the linux select can be get a lower constant part of n too by simply testing whole words of the FD_SET for being zero and skipping them quickly (that makes it a lot cheaper when you have bigger holes in the select fd space). The patch does that too. -Andi --- linux/fs/select.c Mon Jul 24 07:39:44 2000 +++ linux-work/fs/select.c Sat Aug 5 17:23:18 2000 @@ -12,6 +12,9 @@ * 24 January 2000 * Changed sys_poll()/do_poll() to use PAGE_SIZE chunk-based allocation * of fds to overcome nfds < 16390 descriptors limit (Tigran Aivazian). + * + * July 2000 + * Add fast poll/select (Andi Kleen) */ #include @@ -24,6 +27,8 @@ #define ROUND_UP(x,y) (((x)+(y)-1)/(y)) #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM) +#define __MAX_POLL_TABLE_ENTRIES ((PAGE_SIZE - sizeof (struct poll_table_page)) / +sizeof (struct poll_table_entry)) + struct poll_table_entry { struct file * filp; wait_queue_t wait; @@ -32,42 +37,43 @@ struct poll_table_page { struct poll_table_page * next; - struct poll_table_entry * entry; + int nr, max; struct poll_table_entry entries[0]; }; -#define POLL_TABLE_FULL(table) \ - ((unsigned long)((table)->entry+1) > PAGE_SIZE + (unsigned long)(table)) - /* - * Ok, Peter made a complicated, but straightforward multiple_wait() function. - * I have rewritten this, taking some shortcuts: This code may not be easy to - * follow, but it should be free of race-conditions, and it's practical. If you - * understand what I'm doing here, then you understand how the linux - * sleep/wakeup mechanism works. - * - * Two very simple procedures, poll_wait() and poll_freewait() make all the - * work. poll_wait() is an inline-function defined in , - * as all select/poll functions have to call it to add an entry to the - * poll table. + * Tune fast poll/select. Limit is the kernel stack. */ +#define FAST_SELECT_LIMIT 40 +#define FAST_POLL_LIMIT 40 + +#define FAST_SELECT_BYTES ((FAST_SELECT_LIMIT + 7) / 8) +#define FAST_SELECT_ULONG \ + ((FAST_SELECT_BYTES + sizeof(unsigned long) - 1) / sizeof(unsigned long)) + +static inline void do_freewait(struct poll_table_page *p) +{ + struct poll_table_entry * entry; + entry = p->entries + p->nr; + while (p->nr > 0) { + p->nr--; + entry--; + remove_wait_queue(entry->wait_address,&entry->wait); + fput(entry->filp); + } +} void poll_freewait(poll_table* pt) { + struct poll_table_page *old; struct poll_table_page * p = pt->table; while (p) { - struct poll_table_entry * entry; - struct poll_table_page *old; - - entry = p->entry; - do { - entry--; - remove_wait_queue(entry->wait_address,&entry->wait); - fput(entry->filp); - } while (entry > p->entries); old = p; + do_freewait(p); p = p->next; - free_page((unsigned long) old); + if (old->max == __MAX_POLL_TABLE_ENTRIES) { + free_page((unsigned long) old); + } } } @@ -75,7 +81,7 @@ { struct poll_table_page *table = p->table; - if (!table || POLL_TABLE_FULL(table)) { + if (!table ||
Re: Linux's implementation of poll() not scalable?
> I'm trying to write a server that handles 1 clients. On 2.4.x, > the RT signal queue stuff looks like the way to achieve that. I would suggest you try multiple polling threads. Not only will you get better SMP scalability, if you have say 16 threads, each one only has to handle ~ 600 fds. Nick. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Andi Kleen wrote: > > It would be possible to setup a file -> fdnum reverse table (possibly cached > over poll calls, I think Solaris does that) and let the async events directly > change the bits in the output buffer in O(1). I disagree. Let's just face it, poll() is a bad interface scalability-wise. If you want to do efficient events, you should have some other approach, like an event queue. Yes, I know it's a dirty word because NT uses it, but let's face it, poll() was a hack to make it easier to do something select()-like but with the same machinery as select. > Also the current 2.4 poll is very wasteful both in memory and cycles > for small numbers of fd. Yes, we could go back to the "optimize the case of n < 8" thing. Which should be particularly easy with the new "poll_table_page" setup: the thing is more abstracted than it used to be in 2.2. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RAID setup
Hi, Does Linux give options to choose RAID levels from BIOS? with regards, Anil __ Do You Yahoo!? Yahoo! Messenger - Talk while you surf! It's FREE. http://im.yahoo.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: make -j 2 broken?
On Mon, Oct 23, 2000 at 05:22:22PM -0400, Wakko Warner wrote: > > Trying to compile the current kernel (test10-pre4) with: > > > > > make clean > > > make -j 2 bzImages modules modules_install > > > > will try to install the modules before they are built... > > This has previously been working (at least in early testX kernels). > > I've done it before when I wasn't thinking, since it parallels those, > modules_install just calls modules_install. AFAIK, it doesn't depend on > modules actually being compiled. > > Typically, I do: > make -j 20 dep;make -j 20 bzImage modules;make modules_install > > Yes, I've heard about the paralleled dep problem (there was one, right?), > but it hasn't effected me. Or even better, to be sure you don't get any unpleasant surprises: make -j 20 dep && make -j 20 bzImage modules && make modules_install /David _ _ // David Weinehall <[EMAIL PROTECTED]> /> Northern lights wander \\ // Project MCA Linux hacker// Dance across the winter sky // \> http://www.acc.umu.se/~tao/http://www.tux.org/lkml/
Re: IDE-Floppy and devfs
I did report this back a month or so ago. Another work around is to have a device node (22,0 in my case) around to bang on for a sec. After having the kernel poke the device, the appropriate devfs node appears automagically. My 0.02USD. Josh On Sun, 22 Oct 2000, Andreas Franck wrote: > Hi Paul, hi linux-kernel audience, > > some szggestion for the upcoming 2.4 release ide-floppy driver: It > really works > nice without devfs, but the ide-floppy behaviour in connection with > devfs is > a bit strange. > > If ide-floppy is compiled as a module, which is loaded (or autoloaded by > some smart > devfs rule) when no disk is in the drive, NO devfs entries are created, > so there > is no way to access the drive. Even worse, when a disk is inserted, the > module is > still loaded so there is no way to access the disk! Only manual > unloading and reloading > of the module will do the trick. > > I'd suggest something like the cdrom approach: There is always one > device node for > removable devices, regardless of any media present in the drive. This > could > be /dev/ide/host0/bus1/target1/lun0/floppy or something like that. > > Accessing this file should trigger a probe for the media (which may be > partitioned > in ide-floppy devices, which makes life a bit harder). By this probe, > the > /dev/ide/.../lun0/disc and /dev/ide/.../lun0/part4 (or any other > partitions) > might be created, if there is a medium in the drive; a symlink to the > "right" partition > (part4 for normal ZIP disks, AFAIK) shhould be placed in a directory of > its own, > for example /dev/ide/floppy/c0b1t1u0, and anotherone perhaps in > /dev/idefloppy/floppy0 > or something. > > I have already implemented the first half of this (creation of the > floppy node which will > trigger the partition scan when accessed), I have attached my patch for > review here. > Its still quite hackish, but I'm sure you can follow my ideas with what > I explained above > - if not, don't hesitate to ask. > > ->- snip -<- > > -- > ->>>--- Andreas Franck <<<- > ---<<< [EMAIL PROTECTED] --->>>--- > ->>> Keep smiling! <<<- > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[OOPS] with 2.2.17+ide+ext3
Casually browsing through my system logs, I came upon this two oopses that happened together (logged as same second). I don't really remember what situation was surrounding, or even if any interruption was experienced. The system did totally freeze just under 30 minutes later, however, with no oops logged at that time. Neither SysRq nor Ctrl+Alt+Del responded; I had to hit the reset button on the case. Attached are the two recorded oopses as processed by ksymoops. If it makes much difference, the oops wasn't run through ksymoops until after a reboot. ksymoops 2.3.4 on i586 2.2.17ext3. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.2.17ext3/ (default) -m /boot/System.map-2.2.17ext3 (default) Warning: You did not tell me where to find symbol information. I will assume that the log matches the kernel and modules that are running right now and I'll use the default options above for symbol resolution. If the current kernel and/or modules do not match the log, you can get more accurate output by telling me the kernel version and where to find map, modules, ksyms etc. ksymoops -h explains the options. Warning (compare_maps): mismatch on symbol V32U96eyeLocation , pctel says c410b5b0, /lib/modules/2.2.17ext3/misc/pctel.o says c41284d4. Ignoring /lib/modules/2.2.17ext3/misc/pctel.o entry Oops: CPU:0 EIP:0010:[sock_poll+26/48] EFLAGS: 00013293 eax: 9d129dbe ebx: c12da8c0 ecx: c1f87000 edx: c12da8c0 esi: edi: 0400 ebp: 000a esp: c2241ee8 ds: 0018 es: 0018 ss: 0018 Process X (pid: 288, process nr: 28, stackpage=c2241000) Stack: 0040 0020 c2c43540 c012fa5a c12da8c0 c1f87000 1000 0008 0020 c2c43540 Call Trace: [do_select+274/512] [sys_select+881/1176] [sys_gettimeofday+32/148] [system_call+52/56] Code: 8b 50 08 51 50 53 8b 42 20 ff d0 83 c4 10 5b 83 c4 18 c3 8d Using defaults from ksymoops -t elf32-i386 -a i386 Code; Before first symbol <_EIP>: Code; Before first symbol 0: 8b 50 08 mov0x8(%eax),%edx Code; 0003 Before first symbol 3: 51push %ecx Code; 0004 Before first symbol 4: 50push %eax Code; 0005 Before first symbol 5: 53push %ebx Code; 0006 Before first symbol 6: 8b 42 20 mov0x20(%edx),%eax Code; 0009 Before first symbol 9: ff d0 call *%eax Code; 000b Before first symbol b: 83 c4 10 add$0x10,%esp Code; 000e Before first symbol e: 5bpop%ebx Code; 000f Before first symbol f: 83 c4 18 add$0x18,%esp Code; 0012 Before first symbol 12: c3ret Code; 0013 Before first symbol 13: 8d 00 lea(%eax),%eax Oops: CPU:0 EIP:0010:[locks_remove_posix+32/148] EFLAGS: 00013286 eax: 9d129d12 ebx: c12da8c0 ecx: 9d129d12 edx: c12da8c0 esi: c0024e00 edi: ebp: 9d129d82 esp: c2241d6c ds: 0018 es: 0018 ss: 0018 Process X (pid: 288, process nr: 28, stackpage=c2241000) Stack: 0001 9d129d12 0300 c0126a68 c07753e0 9d129dc6 c32f7af8 c0126a8e c07753e0 c0221da4 c0221da3 3246 c2b37660 c0221da3 c0221da4 c01133d3 c3def6e0 c32f7ae0 c2242000 c012596b c12da8c0 c2263400 Call Trace: [fput+32/84] [fput+70/84] [mmput+63/72] [filp_close+83/108] [mm_release+16/52] [do_exit+320/672] [do_exit+201/672] Oct 22 23:29:42 hapablap kernel:[die+71/72] [error_table+9294/9568] [error_table+9216/9568] [do_page_fault+729/992] [error_table+9294/9568] [error_code+45/52] [sock_poll+26/48] [alloc_wait+23/152] Code: 8b 71 70 85 f6 74 62 90 f6 46 24 01 74 52 8b 44 24 64 39 46 Code; Before first symbol <_EIP>: Code; Before first symbol 0: 8b 71 70 mov0x70(%ecx),%esi Code; 0003 Before first symbol 3: 85 f6 test %esi,%esi Code; 0005 Before first symbol 5: 74 62 je 69 <_EIP+0x69> 0069 Before first symbol Code; 0007 Before first symbol 7: 90nop Code; 0008 Before first symbol 8: f6 46 24 01 testb $0x1,0x24(%esi) Code; 000c Before first symbol c: 74 52 je 60 <_EIP+0x60> 0060 Before first symbol Code; 000e Before first symbol e: 8b 44 24 64 mov0x64(%esp,1),%eax Code; 0012 Before first symbol 12: 39 46 00 cmp%eax,0x0(%esi) 2 warnings issued. Results may not be reliable.
Re: Linux's implementation of poll() not scalable?
Jordan Mendelson ([EMAIL PROTECTED]) wrote: > An implementation of /dev/poll for Linux already exists and has shown to > be more scalable than using RT signals under my tests. A patch for 2.2.x > and 2.4.x should be available at the Linux Scalability Project @ > http://www.citi.umich.edu/projects/linux-scalability/ in the patches > section. If you'll look at the page I linked to in my original post, http://www.kegel.com/dkftpbench/Poller_bench.html you'll see that I also benchmarked /dev/poll. When finding 1 active fd among 1, the Solaris implementation creamed the living snot out of the Linux one, even though the Solaris hardware was 5 or so times slower (see the lmbench results on that page). Here are the numbers (times in microseconds): On a 167MHz sun4u Sparc Ultra-1 running SunOS 5.7 (Solaris 7) Generic_106541-11: pipes1001000 1 select151 - - poll470 6763742 /dev/poll 61 70 92 On an idle 650 MHz dual Pentium III running Red Hat Linux 6.2 with kernel 2.2.14smp plus the /dev/poll patch: pipes1001000 1 select 28 - - poll 23 890 11333 /dev/poll 19 1464264 On the same machine as above, but with vanilla kernel 2.4.0-test10-pre4 smp: pipes1001000 1 select 52 - - poll 491184 14660 > It works fairly well, but I was actually somewhat disappointed to find > that it wasn't the primary cause for the system CPU suckage for my > particular system. Granted, when you only have to poll a few times per > second, the overhead of standard poll() just isn't that bad. If you have to poll 100 or fewer sockets, Linux's poll is quite good. If you have to poll 1000 or so sockets, Linux's /dev/poll works well (I wish it were available for the current kernels). But if you have to poll 1 or sockets on Linux, neither standard poll nor /dev/poll as currently implemented performs adequately. I suspect that RT signals on Linux will do much better for N=1 than the current /dev/poll, and hope to benchmark that soon. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > With ClearPageDirty() kernel locked up (but no watchdog, so probably > some livelock) during bootup after fsck / Actually, it turns out that even with this issue fixed, there's the more serious issue that the page _has_ to be removed from the page cache once we get to the point that we're freeing the whole inode (which also calls truncate_inode_pages). Which means that we cannot take the easy way out and say "let's delay the freeing until everything is ok" - even if a buffer is busy due to some unfortunate timing (so that we turn the page into anonymous buffers), we'd better get rid of the page from the page cache. I'm starting to suspect that we leave this path as-is, and just fix the mapping case (and PageUptodate() can work there). That should also avoid the nasties. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 11:14:35AM -0700, Linus Torvalds wrote: > [ Small treatize on "scalability" included. People obviously do not > understand what "scalability" really means. ] > > In article <[EMAIL PROTECTED]>, > Dan Kegel <[EMAIL PROTECTED]> wrote: > >I ran a benchmark to see how long a call to poll() takes > >as you increase the number of idle fd's it has to wade through. > >I used socketpair() to generate the fd's. > > > >Under Solaris 7, when the number of idle sockets was increased from > >100 to 1, the time to check for active sockets with poll() > >increased by a factor of only 6.5. That's a sublinear increase in time, > >pretty spiffy. > > Yeah. It's pretty spiffy. > > Basically, poll() is _fundamentally_ a O(n) interface. There is no way > to avoid it - you have an array, and there simply is _no_ known > algorithm to scan an array in faster than O(n) time. Sorry. One problem in Linux is that it scans multiple times. At least 4 times currently: copyin, setup of wait queues, ask for results, copyout. copyin and copyout are relatively cheap (and could be made even cheaper with /dev/poll), the problem are the two other passes which involve a lot of function pointer calls which generally cause pipeline stalls in modern CPUs. It would be possible to setup a file -> fdnum reverse table (possibly cached over poll calls, I think Solaris does that) and let the async events directly change the bits in the output buffer in O(1). This would save one pass. It may also be possible to cache the wait queue setup over polls, this would make poll much cheaper in terms of cache lines used. Also the current 2.4 poll is very wasteful both in memory and cycles for small numbers of fd. I did some experiments with a poll fast path for small ns by falling back to the 2.0 stack allocation method, and it decreased latency dramatically (>-30%). It also saves a lot of memory, because all the daemons only polling for a few fds wouldn't use two additional pages to their stack page [patches are availble, I just didn't want to submit them during code freeze] -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.2.17 with RedHat 7 Problem !
On Mon, Oct 23, 2000 at 07:48:08PM -0400, David Relson wrote: > Horst, > > What you say is correct. Early comments on gcc-2.96 reflected preprocessor > changes which made it impossible to compile a kernel. Later comments, > particularly David Wragg's "struct itimerval" example, show that compiler > optimizations is broken. > > My recollection is that the behavior of "a[i] = b[i++]" is well defined, > i.e. in the standard. However it's been years since I paid attention to > those details, so I may be wrong. Umm, no, since a[i] = b[i++] does not have a sequence, it is explicitly undefined behavior in the standard. As I recall Bernd Schmidt recently found a number of places where the above construct is used in the Linux kernel. -- Michael Meissner, Red Hat, Inc. PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA Work: [EMAIL PROTECTED] phone: +1 978-486-9304 Non-work: [EMAIL PROTECTED] fax: +1 978-692-4482 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel wrote: > > Linus Torvalds wrote: > > Dan Kegel wrote: > >> [ http://www.kegel.com/dkftpbench/Poller_bench.html ] > >> [ With only one active fd and N idle ones, poll's execution time scales > >> [ as 6N on Solaris, but as 300N on Linux. ] > > > > Basically, poll() is _fundamentally_ a O(n) interface. There is no way > > to avoid it - you have an array, and there simply is _no_ known > > algorithm to scan an array in faster than O(n) time. Sorry. > > ... > > Under Linux, I'm personally more worried about the performance of X etc, > > and small poll()'s are actually common. So I would argue that the > > Solaris scalability is going the wrong way. But as performance really > > depends on the load, and maybe that 1 entry load is what you > > consider "real life", you are of course free to disagree (and you'd be > > equally right ;) > The way I'm implementing RT signal support is by writing a userspace > wrapper to make it look like an OO version of poll(), more or less, > with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's. > When and if I get that working, I may move it into the kernel as an > implementation of /dev/poll -- and then I won't need to worry about > the RT signal queue overflowing anymore, and I won't care how scalable > poll() is. An implementation of /dev/poll for Linux already exists and has shown to be more scalable than using RT signals under my tests. A patch for 2.2.x and 2.4.x should be available at the Linux Scalability Project @ http://www.citi.umich.edu/projects/linux-scalability/ in the patches section. It works fairly well, but I was actually somewhat disappointed to find that it wasn't the primary cause for the system CPU suckage for my particular system. Granted, when you only have to poll a few times per second, the overhead of standard poll() just isn't that bad. Jordan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design
On Mon, Oct 23, 2000 at 07:43:24PM -0400, Dennis wrote: > At 07:19 PM 10/23/2000, Andi Kleen wrote: > >On Mon, Oct 23, 2000 at 06:43:28PM -0400, Dennis wrote: > > > - FreeBSD will display kernel print messages with syslogd not running, and > > > linux will not. > > > >Linux will also when the console log level is set high enough (which it > >is by default, just it is usually too low after you killed klogd). > >Unqualified printks have level 4, so you need a level > that. > >Some distributions also unfortunately set bogus defaults or redirect > >the messages to specific virtual consoles. > > > Another linux caveat. Scads of undocumented and virtually undiscoverable > behaviours :-) It is not undocumented. Try reading the klogd manpage. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design
On Mon, 23 Oct 2000, Hacksaw wrote: > > Another linux caveat. Scads of undocumented and virtually undiscoverable > > behaviours :-) > > Undiscoverable? You have the source code, what more do you want? Start > documenting! Oh no then they would have to publish their findings, and that is only available in binary format or $500.00 USD and threats for a lawyer. Regardless that the original code was GPL I am BATING someone to answer why they are selling GPL code and making legal gestures if you pay for the GPL code and share the GPL code. Someone must have a harder time than me reading the rules. Cheers, Andre Hedrick The Linux ATA/IDE guy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
OOM and my .02 cents
Hi all, I read this week in kernelnotes about the OOM killer and thought I'd share a few thoughts that I had on the subject. I know I am maybe considered a 'nobody' here so my opinion may count very little, but this makes sense to me as a user so I though I'd throw in my .02 cents. If you like it use if not delete this email and forget about it. 1) If it does not already do this it should probably start with warnings like printk statements. (I'll hope it does). 2) There should probably be a way to configure OOM (if there is not already). I.E either a % or in bytes before we say we are out of memory. This could default to something like 10% of the remaining memory or 1% or whatever. This would probably have 2 values. Start warning messages at 10% start killing at 5% or something, or one value could be based on the other value. I'd prefer a percent basis, cause you wont know how much memory a system has till the sytem boots and if someone puts in kill at 32Meg and the system has 16 well, you do the math. This could possiblly even be configurable through the /proc interface or a compile time thing or both. With these two bits of info it would be fairly easy to write a user space program that would scan the sys logs for the OOM warnings and pop up a message in X saying that something needs to die. This kind of functionality could be added into X and then if X sees a warning in the sys logs about OOM then X can refuse to start another program. This is assuming the X group wanted to do so. Personally if they did not I'd do it in Gtk or Xaw or something myself. 3) This should probably be capable of being compiled as a module, so that if someone decided to add functionality to it for X it would not make the kernel (bzImage) grow exponentially. The reason I say this is that if you look at the trends in Linux then you'll realize that everything is moving towards X. Even IBM's via voice has a GUI. Do you really know where Linux will have morphed to in 10 years? It's always better to be flexable now then to find yourself screwed in the future. Even if this were a seperate patch it could possiblly in theory be done. Fact is when that code gets on someones system you don't know what there needs are or what they'll do with it. If you look at windows it pops up a message when you try to start a program and you don't have the memory for it, someone or some distribution could do this for Linux if they were so inclined. You know there are some people out there that are into that kind of thing. 4) Lastly, if this is truely an OOM killer it's hueristics, should probably just see which progam(s) are taking up the most memory, and which one(s) were started last. The last program that is taking up the most memory should then be killed. Why? Chances are that the last program that is started that is taking over the most memory is probably your problem. init is # 1 in my process table and the first 5 are kernel related dameons. X starts after that even when xdm/gdm are running. If X had a memory leak and was killed by the OOM most window mangers will catch this and save a users settings. If anyone looses data it is really there fault for not saving every 5 minutes anyway. Lastly the other option rather than killing the program is to restart the program. just my .02 cents. __ Do You Yahoo!? Yahoo! Messenger - Talk while you surf! It's FREE. http://im.yahoo.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design
> Another linux caveat. Scads of undocumented and virtually undiscoverable > behaviours :-) Undiscoverable? You have the source code, what more do you want? Start documenting! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.2.17 with RedHat 7 Problem !
At 05:44 AM 10/23/00, Horst von Brand wrote: >David Relson <[EMAIL PROTECTED]> said: > >Not just a preprocessor change. >... >This is true for a correct compiler (ever seen a correct piece of >software?) compiling strictly standard-conforming source. The kernel is >_not_ standard-conforming, and many places are writen just like they are to >trick the compiler into generating particular code, some places assume that >undefined behaviour (i.e., a[i] = b[i++] and such) works in a certain way, >that the compiler pads structures in a certain way, ... >... >Yes. The existing program is wrong in that it woprked by chance, not >because it was written right. Horst, What you say is correct. Early comments on gcc-2.96 reflected preprocessor changes which made it impossible to compile a kernel. Later comments, particularly David Wragg's "struct itimerval" example, show that compiler optimizations is broken. My recollection is that the behavior of "a[i] = b[i++]" is well defined, i.e. in the standard. However it's been years since I paid attention to those details, so I may be wrong. Anyhow, as we all know, gcc-2.96 is not ready for prime time. David David Relson Osage Software Systems, Inc. [EMAIL PROTECTED] 514 W. Keech Ave. www.osagesoftware.com Ann Arbor, MI 48103 voice: 734.821.8800fax: 734.821.8800 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design
At 07:19 PM 10/23/2000, Andi Kleen wrote: >On Mon, Oct 23, 2000 at 06:43:28PM -0400, Dennis wrote: > > - FreeBSD will display kernel print messages with syslogd not running, and > > linux will not. > >Linux will also when the console log level is set high enough (which it >is by default, just it is usually too low after you killed klogd). >Unqualified printks have level 4, so you need a level > that. >Some distributions also unfortunately set bogus defaults or redirect >the messages to specific virtual consoles. Another linux caveat. Scads of undocumented and virtually undiscoverable behaviours :-) db - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design + GPL
ftp://ftp.etinc.com/pub/linux/linux22_hdlc.tgz Hi Dennis, Could explain to me why ET Inc is modifying GPL drivers and then republishing the binaries as modules only? Not that it is my sub-system, but I am not sure that my friend Don knows of this issue. If Don does not care then, good day. ls hdlc/usr/hdlc/dev/modules/2.2.14 . eepro100.o etbwmgr.o tulip.o .. eepro100orig.o ethdlc.otuliporig.o Cheers, Andre Hedrick The Linux ATA/IDE guy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: IDE-Floppy and devfs
Hi Paul, hi Richard, hi linux-kernel audience, Paul Bristow wrote: > Hi Andreas, > > Could you re-send the patch, as it didn't make it with your last > message? Sorry, my mailer crashed when I sent this message and I didn't even notice it got sent :-) So here's what you're waitung for (at least I hope so). I have done some further cleanups for module initialization et al., to make it look more like other (newer) drivers I saw elsewhere in the source tree. So here it comes (attached). Greetings, Andreas PS: Sorry for this extra resend to linux-kernel, I incorrectly tried to CC linux[EMAIL PROTECTED] im my mail to Paul and Richard, which didn't work as I could have expected, if it wasn't that late. -- ->>>--- Andreas Franck <<<- ---<<< [EMAIL PROTECTED] --->>>--- ->>> Keep smiling! <<<- --- linux/drivers/ide/ide-floppy.c.old Sun Oct 8 16:42:21 2000 +++ linux/drivers/ide/ide-floppy.c Wed Oct 11 02:15:22 2000 @@ -29,6 +29,15 @@ * Ver 0.9 Jul 4 99 Fix a bug which might have caused the number of *bytes requested on each interrupt to be zero. *Thanks to <[EMAIL PROTECTED]> for pointing this out. + * Ver 0.10 Oct 8 00 Fix devfs support - now a /dev/ide/.../lun0/floppy + *entry, four partition entries (part1..part4), and + *symlinks in /dev/ide/floppy are created for each + *IDE floppy device. + * + */ + +/* + * Changed to support devfs registration - 8.10.2000 Andreas Franck <[EMAIL PROTECTED]> */ #define IDEFLOPPY_VERSION "0.9" @@ -48,6 +57,7 @@ #include #include #include +#include #include #include @@ -236,7 +246,8 @@ idefloppy_capacity_descriptor_t capacity; /* Last format capacity */ idefloppy_flexible_disk_page_t flexible_disk_page; /* Copy of the flexible disk page */ int wp; /* Write protect */ - + devfs_handle_t de; /* devfs entry */ + unsigned int flags; /* Status/Action flags */ } idefloppy_floppy_t; @@ -1526,6 +1537,21 @@ } +/* + * Register drive with devfs + */ +static int idefloppy_register (ide_drive_t *drive, int major, int minor) +{ + idefloppy_floppy_t *floppy = drive->driver_data; + + floppy->de = devfs_register (drive->de, "floppy", DEVFS_FL_DEFAULT, +major, minor, +S_IFBLK | S_IRUGO | S_IWUGO, +ide_fops, NULL); + + return 0; +} + /* * Driver initialization. */ @@ -1560,13 +1586,28 @@ max_sectors[major][minor + i] = 64; } + + idefloppy_register (drive, major, minor); + (void) idefloppy_get_capacity (drive); idefloppy_add_settings(drive); for (i = 0; i < MAX_DRIVES; ++i) { ide_hwif_t *hwif = HWIF(drive); if (drive != &hwif->drives[i]) continue; - hwif->gd->de_arr[i] = drive->de; + hwif->gd->de_arr[i] = drive->de; + + /* +* Stop grok_partitions from building +* the devfs "disc" and "part[1..n]" +* entries, we'll try to do it a little +* bit smarter so partitions exist without +* media. If media is inserted, this partition +* assumptions are corrected to reflect the actual +* partition settings. +*/ +/* hwif->gd->de_arr[i] = 0; */ + if (drive->removable) hwif->gd->flags[i] |= GENHD_FL_REMOVABLE; break; @@ -1579,6 +1620,9 @@ if (ide_unregister_subdriver (drive)) return 1; + + devfs_unregister(floppy->de); + drive->driver_data = NULL; kfree (floppy); return 0; @@ -1601,24 +1645,24 @@ * IDE subdriver functions, registered with ide.c */ static ide_driver_t idefloppy_driver = { - "ide-floppy", /* name */ - IDEFLOPPY_VERSION, /* version */ - ide_floppy, /* media */ - 0, /* busy */ - 1, /* supports_dma */ - 0, /* supports_dsc_overlap */ - idefloppy_cleanup, /* cleanup */ - idefloppy_do_request, /* do_request */ - idefloppy_end_request, /* end_request */ - idefloppy_ioctl,/* ioctl */ - idefloppy_open, /* open */ - idefloppy_release, /* release */ - idefloppy_media_change, /* media_change */ - idefloppy_revalidate, /* media_change */ - NULL, /* pre_reset */ - idefloppy_capacity, /* capac
Re: Kernel 2.2.17 with RedHat 7 Problem !
David Relson <[EMAIL PROTECTED]> said: > At 09:14 PM 10/22/00, Horst von Brand wrote: > >Jurgen Kramer <[EMAIL PROTECTED]> said: > > > You can blame it on the compiler which is included with RH7.0. It's a > > > pre-release version of some sort. It seems that the gcc people are not > > > happy that RH included this version with RH7. > >It is the *kernel's* fault, as far as can be ascertained now. The compiler > >is stricter, and implements new optimizations, for which the kernel (being > >only ever compiled with gcc) is just unprepared. > The problem, as I understand it, is that gcc-2.96 handles language > constructs slightly different than older compilers. This is a preprocessor > change, not an optimization problem. Not just a preprocessor change. > To say "new optimizations ... kernel ... unprepared" is incorrect. Having > worked with compilers (some years ago), I always took it as an article of > faith that the same answer(s) would be generated whether optimization was > turned on or not. This is true for a correct compiler (ever seen a correct piece of software?) compiling strictly standard-conforming source. The kernel is _not_ standard-conforming, and many places are writen just like they are to trick the compiler into generating particular code, some places assume that undefined behaviour (i.e., a[i] = b[i++] and such) works in a certain way, that the compiler pads structures in a certain way, ... > Optimization should always be a way to do a task either > quicker (fewer instructions executing, less executing time, etc) or shorter > (less memory needed for the instructions). Optimization should never, > never give a different result. Having new optimizations break an executing > program is simply wrong. Yes. The existing program is wrong in that it woprked by chance, not because it was written right. -- Horst von Brand [EMAIL PROTECTED] Casilla 9G, Vin~a del Mar, Chile +56 32 672616 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design
On Mon, Oct 23, 2000 at 06:43:28PM -0400, Dennis wrote: > - FreeBSD will display kernel print messages with syslogd not running, and > linux will not. Linux will also when the console log level is set high enough (which it is by default, just it is usually too low after you killed klogd). Unqualified printks have level 4, so you need a level > that. Some distributions also unfortunately set bogus defaults or redirect the messages to specific virtual consoles. > - FreeBSD doesnt (seem) to have the buffering problem that linux does, in > that exceptionally long messages (like decoding a Frame Relay LMI frame > with 1000 elements) work just fine. You cannot print more than the kernel log buffers size without scheduling inbetween to let klogd eat some of the buffer. I don't see a way how FreeBSD could do that better, except if they found a way to store infinite data in a finite buffer (or alternatively not store your LMI frame completely in the syslog, which would be also as bad) It is possible that their default buffer is bigger though. Linux's default is 16K, which is a bit on the low side for many things. You can increase it of course with a simple recompile by changing the define in kernel/printk.c Admittedly one problem in Linux with big printks is that it kills your interrupt latency completely. There are lower overhead alternatives though. > - FreeBSD will display messages immediately without a newline > - FreeBSD messages 1) can be redirected and 2) are printed without a timestamp. Both just like in Linux. The timestamps come from syslogd/klogd, not the kernel. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: MAP_NR for 2.4
On Fri, Oct 20, 2000 at 12:38:57PM +0530, [EMAIL PROTECTED] wrote: > for using MAP_NR with 2.4, i think you can use > macro like > > #define MAP_NR(addr) (((unsigned long)(addr)-PAGE_OFFSET) >>PAGE_SHIFT) This only works for contiguous memory. Ralf - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-lvm] 2.4.0-test10-pre4 oops with LVM
On Thu, Oct 19, 2000 at 07:07:52PM -0700, Myles Uyema wrote: > This kernel panic occurred when I attempted to dump my ext2 filesystem > /home onto /mnt/ide/backup. My exact dump command was: > > dump -0 -u -M -f /mnt/ide/backup/home -B 2096128 /home > > I believe dump managed to start reading directories before the oops > occurred...Haven't been able to duplicate the oops on 2.4.0-test9. Content-Description: 2.4.0-test10pre4 oops log > Oct 19 17:58:13 uyema kernel: kernel BUG at vmscan.c:102! Remove lines 101-104 in mm/vmscan.c of linux-2.4.0-test10-pre4. Someone was a bit too paranoid in MM handling ;-) MM guys and even Linus would tell you the same[1]. Regards Ingo Oeser [1] As they did already for other people facing this problem. -- Feel the power of the penguin - run [EMAIL PROTECTED] :x - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP
On Mon, 23 Oct 2000, Jeff Garzik wrote: > First test was with 2.4.0-test10-pre3. > Next four tests were with 2.4.0-test10-pre4. > Final four tests were with 2.2.18-pre17. > > All are 'virgin' kernels, without any patches. [...] I'll take the liberty of highlighting some big changes, v2.2 vs v2.4 *Local* Communication latencies in microseconds - smaller is better --- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn - - - - - - - - rum.normn Linux 2.4.0-t 620 4563 10681 157 146 rum.normn Linux 2.2.18p 212 1856 123 106 159 237 - So we broke pipe/AF UNIX latencies File & VM system latencies in microseconds - smaller is better -- Host OS 0K File 10K File MmapProtPage Create Delete Create Delete Latency Fault Fault - - -- -- -- -- --- - - rum.normn Linux 2.4.0-t 15 1 28 3 1016 10.0K rum.normn Linux 2.2.18p 16 1 29 2 7658 20.6K - But gave steroids to mmap latencies *Local* Communication bandwidths in MB/s - bigger is better --- HostOS Pipe AFTCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write - - -- -- -- -- - rum.normn Linux 2.4.0-t 152 105 98151326138144 326 171 rum.normn Linux 2.2.18p 264 106 55152326137142 326 180 - Mixed fortunes here. A serious boost to TCP bandwidth but pipe bandwidth dies a bit Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
- Received message begins Here - > > Jesse Pollard <[EMAIL PROTECTED]> writes: > > > Don't configure syslogd to do reverse lookups. > > Our syslogd has no option to disable the reverse lookups. > > > You can NEVER guarantee that the reverse lookup will succeed, and > > can be delayed several minutes for a single reply. > > Not true. The named on our loghost is authoritative for the reverse > mappings for all of the machines which can log there. If the server is sufficiently busy, the reply WILL be delayed. It can even be on the same host, it just may not get around to a reply immediately. Authority is not equivalent to reply. The advantage of local files is that the reverse lookup doesn't depend on ANY outside agency. > > I consider this a configuration error. I don't believe syslogd > > should ever do a reverse lookup, since the name you are trying to > > get may never arrive, or if arrives, it may be spoofed. > > There *is* no configuration for these tools which gives the behavior > you describe, so this is not a "configuration error". I think it requires a recompile of the syslogd source to make that change. I grant that there isn't a runtime option for it (I do think there should be). > > It's not a bug, but a security feature. NO log to syslogd should be > > lost, since it may be related to an attack. > > Historically, no other Unix system has had reliable syslogging. It > would require very defensive programming for syslogd, and that has > clearly not been performed. Some do, some don't - Cray systems will shut down if the audit daemon disappears. Syslogd on Linux is providing the corresponding facility, at the current time. Neither syslogd, nor audit daemon are reliable on SGI systems, trusted solaris shuts down - I believe. The normal solaris is unknown since I haven't had time to fully configure either it or the audit daemon yet. > And if this is what GNU/Linux intends, why does glibc use a SOCK_DGRAM > socket for communication with syslogd? By definition, such sockets > are *unreliable*. If syslog is supposed to be reliable, a different > connection type must be used. I think that is a "don't care" option on AF_UNIX connections. Required for semantic compatability, but not used. > > Your philosophy that "no syslog message should ever be lost" is not > necessarily bad. But it is clearly at odds with historical practice, > the current glibc syslog() implementation, and the current syslogd > itself. > > It is true that glibc falls back to using SOCK_STREAM if the > SOCK_DGRAM connection fails. Does that mean GNU/Linux is expects > syslog to be reliable eventually? If so, then my problem is entirely > a bug in syslogd and I will report it as such. I think it is only reliable on the local host. If a remote syslog facility is used, then it may be dropped if the input queue becomes too long. - Jesse I Pollard, II Email: [EMAIL PROTECTED] Any opinions expressed are solely my own. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design
At 04:35 PM 10/23/2000, you wrote: >On Mon, 23 Oct 2000, Dennis wrote: > > > This is typical of the "linux mentality". Why do other OSs have solutions > > that work, yet linux's method requires special coding? If it "has to be > > done that way", why do other OS's have solutions that dont do it that way? > > the size of the buffer is an annoyance but not a serious problem however. > > > >I'm not sure that Linux requires any special coding. > > > printing directly to the console (as BSD does) is useful when debugging a > > panic, as you can trace right to the panic point. Also certain levels of > >BSD does not write directly to the console. Its console is a direct >clone of Linux. I'm not sure which came first, but when you have a single >screen-card there are not too many ways to get the character and attribute >into screen memory. Linux allows the console to be redirected to a serial >port. BSD does not last time I checked. - FreeBSD will display kernel print messages with syslogd not running, and linux will not. - FreeBSD doesnt (seem) to have the buffering problem that linux does, in that exceptionally long messages (like decoding a Frame Relay LMI frame with 1000 elements) work just fine. - FreeBSD will display messages immediately without a newline - FreeBSD messages 1) can be redirected and 2) are printed without a timestamp. which implies that you are wrong about just about everything. >What? The API has remained consistent since 0.99. It's only internal >kernel stuff that has changed. If you wrote code that worked on 0.99, >it will still compile and work on 2.2.17. We are talking about the kernel. What are you talking about? The external view is meaningless. the device driver interface changed substantially in 2.0. The module interface changed, as it is changing again in 2.4. The PCI interface has changed and is changing again. >The only reason to get the 'latest' version is to take advantage >of increased functionality. This, by definition, means that something >has changed. That's what you upgrade for. The word "unstable" is a >misnomer. Usually my customers want to upgrade because of some security fix or bug fix, not to get new features. > > My point is that there is no "stable kernel series". 2.2.0 wasnt stable, > > and neither was 2.2.3. Virtually all of the ethernet drivers still lock up > > under heavy load in 2.2.17...and now with 2.4 there are more countless > > adventures ahead > >Which Ethernet drivers are you having trouble with? The ones that had >lockup problems (incidentally hardware related), now have reset code >that runs off a watch-dog. The eepro100 driver is an ongoing project, still with lockup problems. the tulip has issues as well. > > an example of "poor planning" is that skput and skpush will panic the > > kernel if there is no room (this can happen with multiple encapsulations) > > The proper behaviour would be to return a NULL pointer indicating failure, > > and then to drop the frame and issue a warning. > >The proper response to any resource not being available (in networking) >is to drop the packet on the floor, smash it into little pieces, and >don't tell anybody about it. > >The packet will be sent again. But, if you can't transmit a packet, >therefore freeing a buffer, what do you do? > >What you do is realilize that the failure to transmit was likely >caused by a disconnected wire. In the drivers I use, I simply pretend >that every packet got transmitted okay. This usually involves a >one or two-line modification to the driver. > >This has nothing to do with poor planning. It just has to do with >a design decision that I didn't agree with. Somebody decided that >network data was precious and therefore the machine should kill itself >if necessary to get the data through. > >I didn't agree with this so I changed a few lines of code. You can't >kill any of my machines by flooding them and they never lock up. >Further, they run at 85 to 90 percent of the network physical layer >bandwidth. My main machine is our domain name-server, it gets between >2000 and 5000 hits per second. If it crashed, our whole LAN goes >down. It doesn't. It runs Linux-2.2.17. Right. So your answer is that linux is OK if you modify all of the broken stuff yourself. Im glad we are in agreement on that, if nothing else. DB - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > Dan Kegel wrote: >> [ http://www.kegel.com/dkftpbench/Poller_bench.html ] >> [ With only one active fd and N idle ones, poll's execution time scales >> [ as 6N on Solaris, but as 300N on Linux. ] > > Basically, poll() is _fundamentally_ a O(n) interface. There is no way > to avoid it - you have an array, and there simply is _no_ known > algorithm to scan an array in faster than O(n) time. Sorry. > ... > Under Linux, I'm personally more worried about the performance of X etc, > and small poll()'s are actually common. So I would argue that the > Solaris scalability is going the wrong way. But as performance really > depends on the load, and maybe that 1 entry load is what you > consider "real life", you are of course free to disagree (and you'd be > equally right ;) I don't think I was being as clueless as you feared. Solaris' poll() somehow manages to skip inactive fd's very efficiently. But I'm happy to agree that small poll()'s are very important. I'd prefer to never use big poll()'s. The RT signal stuff scales much better. I'm trying to write a server that handles 1 clients. On 2.4.x, the RT signal queue stuff looks like the way to achieve that. Unfortunately, when the RT signal queue overflows, the consensus seems to be that you fall back to a big poll(). And even though the RT signal queue [almost] never overflows, it certainly can, and servers have to be able to handle it. The way I'm implementing RT signal support is by writing a userspace wrapper to make it look like an OO version of poll(), more or less, with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's. When and if I get that working, I may move it into the kernel as an implementation of /dev/poll -- and then I won't need to worry about the RT signal queue overflowing anymore, and I won't care how scalable poll() is. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
On Mon, Oct 23, 2000 at 04:56:26PM -0400, Patrick J. LoPresti wrote: > You are effectively suggesting that named should be rewritten not to > use the glibc syslog functions at all. That strikes me as the worst > suggestion so far; it would be far better for syslogd not to do name > lookups. But my syslogd has no option to avoid name lookups; I will > submit a request to add one. /etc/nsswitch.conf: hosts: files dns /etc/hosts: ip.of.named.host name.of.named.host ip.of.other.host name.of.other.host Give explicite IP/name mappings for those which you don't want to be looked via the resolver. This is, of course, system-wide, but use it sparingly. /Matti Aarnio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > With ClearPageDirty() kernel locked up (but no watchdog, so probably > some livelock) during bootup after fsck /. Yeah, the way the truncate logic works right now truncate_whole_page() has to remove the page from the inode list - otherwise truncate ends up looping forever trying to truncate that page ;). And there was a off-by-one error in my first untested version anyway: the page_count() should be tested against "2", not "1", as the truncate logic has elevated the count anyway (probably unnecessarily: once we get the page lock nobody else can race to remove it from the page cache anyway, so it's not as if it could go away). > Should I try ClearPageUptodate() instead? No, I'll have to fix the truncate logic to allow for this all. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
LMbench 2.4.0-test10pre-UP vs. 2.2.18pre-UP
Hardware: Single P-III 500 Mhz 64 MB RAM 13GB hard drive First five tests were with 2.4.0-test10-pre4. Final four tests were with 2.2.18-pre17. All are 'virgin' kernels, without any patches. -- Jeff Garzik| The difference between laziness and Building 1024 | prioritization is the end result. MandrakeSoft | L M B E N C H 2 . 0 S U M M A R Y (Alpha software, do not distribute) Basic system parameters Host OS Description Mhz - - --- hum.normn Linux 2.4.0-t i686-pc-linux-gnu 500 hum.normn Linux 2.4.0-t i686-pc-linux-gnu 500 hum.normn Linux 2.4.0-t i686-pc-linux-gnu 500 hum.normn Linux 2.4.0-t i686-pc-linux-gnu 500 hum.normn Linux 2.4.0-t i686-pc-linux-gnu 500 hum.normn Linux 2.2.18p i686-pc-linux-gnu 500 hum.normn Linux 2.2.18p i686-pc-linux-gnu 500 hum.normn Linux 2.2.18p i686-pc-linux-gnu 500 hum.normn Linux 2.2.18p i686-pc-linux-gnu 500 Processor, Processes - times in microseconds - smaller is better Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos inst hndl proc proc proc - - - hum.normn Linux 2.4.0-t 500 0.6 0.9 3.5 5.636 1.6 5.3 0.3K 1.4K 8.3K hum.normn Linux 2.4.0-t 500 0.6 0.8 3.5 5.533 1.6 5.3 0.3K 1.4K 8.2K hum.normn Linux 2.4.0-t 500 0.6 0.8 3.5 5.533 1.7 5.3 0.3K 1.4K 8.2K hum.normn Linux 2.4.0-t 500 0.6 0.8 3.5 5.533 1.6 5.3 0.3K 1.5K 8.2K hum.normn Linux 2.4.0-t 500 0.6 0.8 3.5 5.637 1.7 5.3 0.3K 1.4K 8.2K hum.normn Linux 2.2.18p 500 0.6 0.9 3.9 5.331 1.7 2.3 0.3K 1.3K 7.8K hum.normn Linux 2.2.18p 500 0.6 0.8 3.8 5.231 1.7 2.3 0.3K 1.3K 7.8K hum.normn Linux 2.2.18p 500 0.6 0.9 3.6 5.131 1.7 2.3 0.3K 1.3K 7.8K hum.normn Linux 2.2.18p 500 0.6 0.9 3.7 5.131 1.7 2.3 0.3K 1.3K 7.8K Context switching - times in microseconds - smaller is better - Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw - - - -- -- -- -- --- --- hum.normn Linux 2.4.0-t 1 13 41 14103 18 138 hum.normn Linux 2.4.0-t 1 13 41 14 96 15 140 hum.normn Linux 2.4.0-t 1 13 40 14117 21 143 hum.normn Linux 2.4.0-t 1 13 42 14 90 17 135 hum.normn Linux 2.4.0-t 1 13 41 14 94 19 143 hum.normn Linux 2.2.18p 1 13 40 13 89 16 143 hum.normn Linux 2.2.18p 0 13 41 13 89 17 143 hum.normn Linux 2.2.18p 1 13 41 14 82 18 147 hum.normn Linux 2.2.18p 0 13 40 13 82 16 145 *Local* Communication latencies in microseconds - smaller is better --- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn - - - - - - - - hum.normn Linux 2.4.0-t 1 7 12258144 123 155 hum.normn Linux 2.4.0-t 1 7 12268244 124 159 hum.normn Linux 2.4.0-t 1 7 12268144 125 159 hum.normn Linux 2.4.0-t 1 7 12268344 120 155 hum.normn Linux 2.4.0-t 1 7 12268344 123 159 hum.normn Linux 2.2.18p 1 6 13257043 106 152 hum.normn Linux 2.2.18p 0 6 13257043 105 154 hum.normn Linux 2.2.18p 1 6 13257043 106 151 hum.normn Linux 2.2.18p 0 6 14257143 107 151 File & VM system latencies in microseconds - smaller is better -- Host OS 0K File 10K File MmapProtPage Create Delete Create Delete Latency Fault Fault - - -- -- -- -- --- - - hum.normn Linux 2.4.0-t 12 1 27 29 305 10.0K hum.normn Linux 2.4.0-t 12 1 27 31 292 10.0K hum.normn Linux 2.4.0-t 13 1 28 30 297 10.0K hum.normn Linux 2.4.0-t 12 1 27 34 286 10.0K hum.normn Linux
LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP
Hardware: Dual P-II 400 Mhz 128 MB RAM 13GB hard drive First test was with 2.4.0-test10-pre3. Next four tests were with 2.4.0-test10-pre4. Final four tests were with 2.2.18-pre17. All are 'virgin' kernels, without any patches. -- Jeff Garzik| The difference between laziness and Building 1024 | prioritization is the end result. MandrakeSoft | L M B E N C H 2 . 0 S U M M A R Y (Alpha software, do not distribute) Basic system parameters Host OS Description Mhz - - --- rum.normn Linux 2.4.0-t i686-pc-linux-gnu 401 rum.normn Linux 2.4.0-t i686-pc-linux-gnu 401 rum.normn Linux 2.4.0-t i686-pc-linux-gnu 401 rum.normn Linux 2.4.0-t i686-pc-linux-gnu 401 rum.normn Linux 2.4.0-t i686-pc-linux-gnu 401 rum.normn Linux 2.2.18p i686-pc-linux-gnu 401 rum.normn Linux 2.2.18p i686-pc-linux-gnu 401 rum.normn Linux 2.2.18p i686-pc-linux-gnu 401 rum.normn Linux 2.2.18p i686-pc-linux-gnu 401 Processor, Processes - times in microseconds - smaller is better Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos inst hndl proc proc proc - - - rum.normn Linux 2.4.0-t 401 0.8 1.3 6.0 8.997 2.1 4.3 0.6K 2.0K 10.K rum.normn Linux 2.4.0-t 401 0.8 1.4 5.8 8.377 2.1 4.3 0.6K 2.1K 10.K rum.normn Linux 2.4.0-t 401 0.8 1.4 5.8 8.474 2.1 4.3 0.6K 2.1K 10.K rum.normn Linux 2.4.0-t 401 0.8 1.4 5.8 8.371 2.1 4.3 0.6K 2.1K 10.K rum.normn Linux 2.4.0-t 401 0.8 1.4 5.8 8.471 2.1 4.3 0.6K 2.0K 10.K rum.normn Linux 2.2.18p 401 0.9 1.3 5.3 7.340 2.3 3.3 0.5K 1.8K 9.8K rum.normn Linux 2.2.18p 401 0.8 1.3 5.1 7.240 2.2 3.3 0.5K 1.8K 9.9K rum.normn Linux 2.2.18p 401 0.9 1.3 5.3 7.440 2.3 3.3 0.5K 1.8K 9.9K rum.normn Linux 2.2.18p 401 0.8 1.3 5.1 7.140 2.2 3.3 0.5K 1.9K 9.9K Context switching - times in microseconds - smaller is better - Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw - - - -- -- -- -- --- --- rum.normn Linux 2.4.0-t 5 21 55 23106 27 154 rum.normn Linux 2.4.0-t 5 20 55 20117 22 147 rum.normn Linux 2.4.0-t 5 20 72 21116 26 148 rum.normn Linux 2.4.0-t 5 21 55 22114 28 154 rum.normn Linux 2.4.0-t 6 21 55 21119 31 148 rum.normn Linux 2.2.18p 2 18 56 18106 22 158 rum.normn Linux 2.2.18p 2 18 52 18120 26 173 rum.normn Linux 2.2.18p 1 18 54 18110 29 173 rum.normn Linux 2.2.18p 2 18 54 18116 24 169 *Local* Communication latencies in microseconds - smaller is better --- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn - - - - - - - - rum.normn Linux 2.4.0-t 520 4261 10680 160 149 rum.normn Linux 2.4.0-t 523 4363 10582 156 146 rum.normn Linux 2.4.0-t 523 4363 10582 159 147 rum.normn Linux 2.4.0-t 524 4464 10481 156 147 rum.normn Linux 2.4.0-t 620 4563 10681 157 146 rum.normn Linux 2.2.18p 212 1856 123 106 159 237 rum.normn Linux 2.2.18p 212 2064 123 107 159 240 rum.normn Linux 2.2.18p 112 2154 123 107 160 237 rum.normn Linux 2.2.18p 212 2152 124 108 159 236 File & VM system latencies in microseconds - smaller is better -- Host OS 0K File 10K File MmapProtPage Create Delete Create Delete Latency Fault Fault - - -- -- -- -- --- - - rum.normn Linux 2.4.0-t 15 1 28 3 954 10.0K rum.normn Linux 2.4.0-t 15 1 28 3 1001 10.0K rum.normn Linux 2.4.0-t 15 1 28 3 1022 10.0K rum.normn Linux 2.4.0-t 15 1 28 3
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
On 23 Oct 2000, Patrick J. LoPresti wrote: >Once the name resolution times out, you might expect things to become >unstuck. But they don't. Negative. Things have been queued. The deadlock will only go away if the very next message processed is the named local message. And then it would have to process a few more local messages so it wouldn't stall again so soon. >> Per chance are you running the name service caching daemon (nscd)? > >No. Please do. That will reduce the ammount of traffic to the name server. >> I'd also guess you aren't disabling fsync() for your sysylog files >> (it's part of the syslog.conf format) -- this is a conciderable >> drain on syslogd. > >I see no documentation for such an option in the syslog.conf man page. >This is with the current Red Hat 6.2 syslogd (package >sysklogd-1.3.31-17). It's in the syslogd and syslog.conf man page (sysklogd-1.3.31-16): (syslog.conf) Regular File Typically messages are logged to real files. The file has to be specified with full pathname, beginning with a slash ``/''. You may prefix each entry with the minus ``-'' sign to omit syncing the file after every logging. Note that you might lose information if the system crashes right behind a write attempt. Nevertheless this might give you back some performance, especially if you run programs that use logging in a very verbose manner. --Ricky PS: as a side note, you can/will lose information even if sync is enabled. (fsync() will not flush metadata so the file is truncated on restart.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The zen of kernel virtual addresses
On Sat, Oct 21, 2000 at 01:37:26PM -0600, Jonathan Corbet wrote: > physical address > An address as known by the low-level hardware. In the modern > world, these can be 64-bit quantities, even on 32-bit systems. > These are the addresses used by /dev/mem - which appears to work > only for low memory. A phyical address is the address the CPU uses to talk to memory. It is not necessarily the same kind of address a device uses to see memory: they use bus addresses. The simple case is bus address == physical address, but there are many variations. Systems with an IOMMU (or equiv) present devices with a completely different view of system memory. J PGP signature
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 14:34, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Alexander Viro wrote: > > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > > > > Nope, that just makes the race window smaller. We should check for i_size > > > after we've gotten the page table lock and just before actually entering > > > the page into the page tables. Otherwise we'll still race on SMP (a _very_ > > > hard window to get into, admittedly). > > > > Umm... I would probably remove Uptodate upon truncate() and check _that_ > > in the place you've mentioned. > > Works for me.. > > > > ClearPageDirty(page); > > ClearPageUptodate(page); > > > > How about that? > > Makes sense. With ClearPageDirty() kernel locked up (but no watchdog, so probably some livelock) during bootup after fsck /. Unfortunately, 'PC' column shows complete garbage - init,kswapd,kupdate and rcS in 'R', with 'PC'=current for rcS, 0xc1459f28 for init, 0xc146ffa8 for kswapd and 0xc1469fc8 for kupdate (needless to say that my kernel does not have 20MB... whee what's with PC? it now prints esp instead of eip?). Should I try ClearPageUptodate() instead? Petr Vandrovec [EMAIL PROTECTED] P.S.: I have to go home. Continuing after 10 hours. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
Ricky Beam <[EMAIL PROTECTED]> writes: > Personally, I'd look closely at your setup to determine exactly why > this has become a problem. named is being blocked on writing to > /dev/log. This should only happen if there is sufficient _local_ > syslog traffic to fill the buffer or syslogd has too much remote > traffic to ever read from /dev/log. There is a lot of local traffic as well, yes. Lots of local traffic means named eventually finds itself waiting in line to log. Lots of remote traffic means syslogd is trying to talk to named a lot (to do reverse lookups). named waiting in line + syslogd trying to talk to named == deadlock; this is not too hard to see. Once the name resolution times out, you might expect things to become unstuck. But they don't. Perhaps syslogd is not giving higher priority to local messages; if it did, maybe it could recover from the deadlock. But this would not be a reliable solution; the only reliable solution is for syslogd to be independent of any processes which need to talk to it. > Per chance are you running the name service caching daemon (nscd)? No. > I'd also guess you aren't disabling fsync() for your sysylog files > (it's part of the syslog.conf format) -- this is a conciderable > drain on syslogd. I see no documentation for such an option in the syslog.conf man page. This is with the current Red Hat 6.2 syslogd (package sysklogd-1.3.31-17). - Pat - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
On 23 Oct 2000, Patrick J. LoPresti wrote: >So I have the glibc maintainer (and others) saying that syslog >messages should never be dropped, and you saying that named should be >dropping its syslog messages. No, I didn't say they "should" be dropped but merely that dropping them would fix your problem. Personally, I'd look closely at your setup to determine exactly why this has become a problem. named is being blocked on writing to /dev/log. This should only happen if there is sufficient _local_ syslog traffic to fill the buffer or syslogd has too much remote traffic to ever read from /dev/log. Per chance are you running the name service caching daemon (nscd)? I'd also guess you aren't disabling fsync() for your sysylog files (it's part of the syslog.conf format) -- this is a conciderable drain on syslogd. --Ricky - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > > Nope, that just makes the race window smaller. We should check for i_size > > after we've gotten the page table lock and just before actually entering > > the page into the page tables. Otherwise we'll still race on SMP (a _very_ > > hard window to get into, admittedly). > > Umm... I would probably remove Uptodate upon truncate() and check _that_ > in the place you've mentioned. Works for me.. > > ClearPageDirty(page); > ClearPageUptodate(page); > > How about that? Makes sense. Note that I'd actually like to hear from Petr first _without_ any added code in the nopage() handler - the issue of having a page mapped after nopage() that we shouldn't have mapped is a separate one, and I'd first like to hear if the problem really goes away even if the mapping bug is still there.. And _then_ we fix the fact that we should not allow anybody to have a page mapped past i_size. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: VMWare and kswapd
On Mon, 23 Oct 2000, Michael Rothwell wrote: > I'm trying out the new VMWare for Linux, and noticed that if I let it > run for a couple of days, I get this: > > Oct 16 15:07:49 cartman kernel: VM: do_try_to_free_pages failed for > kswapd... [...snip...] I think this a known issue with the 2.2 kernels. Take a look in the mailing list archives for a thread with subject line: VM: do_try_to_free_memory failed for , 2.2.17, 2.2.18pre3 starting around Oct 11 or 12. shane - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). > This signals to me: should not vmtruncate_list acquire mm->mmap_sem, > if it modifies page tables? No. It should get the page_table lock, but that is sufficient for anybody who _clears_ page tables (and is pretty much the same case as paging something out of somebody elses page tables). >I cannot find anything what prevents doing > vmtruncate in one task and filemap_sync in another - neither > page_table_lock spinlock, nor mmap_sem semaphore... filemap_sync() does hold the page table lock. (Which is certainly not to say that it's necessarily bug-free, but I don't see any obvious problems off-hand). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Alexander Viro wrote: > > > > That's fine, but I'm afraid that we'll need a bit more than that. A couple of > > obvious ones: > > * filemap_nopage() needs the second check for ->i_size. Upon exit. > > Nope, that just makes the race window smaller. We should check for i_size > after we've gotten the page table lock and just before actually entering > the page into the page tables. Otherwise we'll still race on SMP (a _very_ > hard window to get into, admittedly). Umm... I would probably remove Uptodate upon truncate() and check _that_ in the place you've mentioned. > So how about this truncate_complete_page() implementation: > > /* >* Try to get rid of a page.. Clear it if it fails >* for some reason. The page must be locked upon calling >* this function. >* >* We remove the page from the page cache _after_ we have >* destroyed all buffer-cache references to it. Otherwise some >* other process might think this inode page is not in the >* page cache and creates a buffer-cache alias to it causing >* all sorts of fun problems ... >*/ > static inline void truncate_complete_page(struct page *page) > { > /* Try to get rid of buffers */ > if (page->buffers) > block_flushpage(page, 0); > > spin_lock(&pagecache_lock); > spin_lock(&pagemap_lru_lock); > > if (page_count(page) != 1) { > memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE); > } else { > ClearPageDirty(page); ClearPageUptodate(page); How about that? > __lru_cache_del(page); > __remove_inode_page(page); > page_cache_release(page); > } > spin_unlock(&pagemap_lru_lock); > spin_unlock(&pagecache_lock); > } > > we should probably special-case the "block_flushpage()" failed case, but > the above should do reasonable things with it (because page_count() will > be > 1 due to buffers). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Tobias Ringstrom wrote: > On 23 Oct 2000, Linus Torvalds wrote: > > > Either > > > > (a) Solaris has solved the faster-than-light problem, and Sun engineers > > should get a Nobel price in physics or something. > > > > (b) Solaris "scales" by being optimized for 1 entries, and not > > speeding up sufficiently for a small number of entries. > > > > You make the call. > > You will probably get the 6.5 factor because you have some (big) contant > setup time. For example > > t = 1700 + n > > This gives you an increase of 6.5 from 100 to 1, although it is of > course is O(n). No magic there... :-) Indeed. NOTE! I'm not saying that this is necessarily bad. It may well be that the setup time means that the Solaris code actually _does_ perform really well for the 1 entry case. I hope nobody took my rant against "scalability" to mean that I consider the Solaris case to necessarily be bad engineering. My rant was more about people often thinking that "scalability == performance", which it isn't. It may be that Solaris simply wants to do 1 entries really fast and that they actually do really well, but it is clear that if so they have a huge performance hit for the small cases, and people should realize that. It's basically a matter of making the proper trade-offs. I think that the "few file descriptors" case is actually arguably a very important one. Sun (who has long since given up on the desktop) probably have a different tradeoff, and maybe their big constant setup-time is worth it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > > That's fine, but I'm afraid that we'll need a bit more than that. A couple of > obvious ones: > * filemap_nopage() needs the second check for ->i_size. Upon exit. Nope, that just makes the race window smaller. We should check for i_size after we've gotten the page table lock and just before actually entering the page into the page tables. Otherwise we'll still race on SMP (a _very_ hard window to get into, admittedly). > Moreover, what is (area->vm_mm == current->mm) doing in the existing check? It's for ptrace. You can do ugly things with ptrace that aren't possible for the process itself. > * truncate() should zero the page out if it doesn't remove it from > cache. So how about this truncate_complete_page() implementation: /* * Try to get rid of a page.. Clear it if it fails * for some reason. The page must be locked upon calling * this function. * * We remove the page from the page cache _after_ we have * destroyed all buffer-cache references to it. Otherwise some * other process might think this inode page is not in the * page cache and creates a buffer-cache alias to it causing * all sorts of fun problems ... */ static inline void truncate_complete_page(struct page *page) { /* Try to get rid of buffers */ if (page->buffers) block_flushpage(page, 0); spin_lock(&pagecache_lock); spin_lock(&pagemap_lru_lock); if (page_count(page) != 1) { memclear_highpage_flush(page, 0, PAGE_CACHE_SIZE); } else { ClearPageDirty(page); __lru_cache_del(page); __remove_inode_page(page); page_cache_release(page); } spin_unlock(&pagemap_lru_lock); spin_unlock(&pagecache_lock); } we should probably special-case the "block_flushpage()" failed case, but the above should do reasonable things with it (because page_count() will be > 1 due to buffers). The above is obviously completely and utterly untested. Petr? Willing to give this a go? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
linux 2.2.18-pre17: "Kernel panic: LRU list corrupted"
Hi there, I wanted to let you know that I was trying 2.2.18-pre17 on hera.kernel.org, a uniprocessor with an SMP motherboard. After about six hours, it went catatonic, responding to pings and TCP SYNs but not doing anything that required user space. On the console, it had multiple copies of the message: "Kernel panic: LRU list corrupted" [fs/buffer.c:438] ... but no register dump. I have fallen back to 2.2.17 and it has run stably for a few days now. -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 13:57, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > > First page->mapping == NULL entry in syslog is dated 22:23:58, but > > couple of entries was lost before (probably I should print only '.' for > > each such page; this run there was more than 100 such pages) > > Another question is why SIGCHLD was delivered to parent AFTER ftruncate, > > but exit was invoked couple of seconds before - maybe it syncs > > child address space to disk? > > exit() basically does do a msync(MSASYNC), so that could be it. Yes. With sleep(60) no oops occur (it takes ~45 secs to exit child). This signals to me: should not vmtruncate_list acquire mm->mmap_sem, if it modifies page tables? I cannot find anything what prevents doing vmtruncate in one task and filemap_sync in another - neither page_table_lock spinlock, nor mmap_sem semaphore... Thanks, Petr Vandrovec [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On 23 Oct 2000, Linus Torvalds wrote: > Either > > (a) Solaris has solved the faster-than-light problem, and Sun engineers > should get a Nobel price in physics or something. > > (b) Solaris "scales" by being optimized for 1 entries, and not > speeding up sufficiently for a small number of entries. > > You make the call. You will probably get the 6.5 factor because you have some (big) contant setup time. For example t = 1700 + n This gives you an increase of 6.5 from 100 to 1, although it is of course is O(n). No magic there... :-) /Tobias - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: make -j 2 broken?
> Trying to compile the current kernel (test10-pre4) with: > > > make clean > > make -j 2 bzImages modules modules_install > > will try to install the modules before they are built... > This has previously been working (at least in early testX kernels). I've done it before when I wasn't thinking, since it parallels those, modules_install just calls modules_install. AFAIK, it doesn't depend on modules actually being compiled. Typically, I do: make -j 20 dep;make -j 20 bzImage modules;make modules_install Yes, I've heard about the paralleled dep problem (there was one, right?), but it hasn't effected me. -- Lab tests show that use of micro$oft causes cancer in lab animals - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > On Mon, 23 Oct 2000, Alexander Viro wrote: > > > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > > > Al, any ideas? I have this feeling that the simplest fix is just to leave > > > the race open, and make truncate_complete_page() just leave such a "racy" > > > page in the page cache. It will still race, and the invalid page will > > > still exist, but the end result should be harmless. > > > > Provided that we clean it - why the hell do we want to take it out of > > the pagecache? I don't see any fundamental reasons to prohibit pages > > past the ->i_size being hashed. > > In fact, we used to allow them. > > We do want to remove it from the page cache under normal circumstances: it > makes for much better MM behaviour (ie free pages that are truly useless). > But I suspect that just adding a test for "page_count(page) == 1" (ie same > as for the page cache invalidation) before freeing it should give that > advantage, along with avoiding the problematic unlikely case... That's fine, but I'm afraid that we'll need a bit more than that. A couple of obvious ones: * filemap_nopage() needs the second check for ->i_size. Upon exit. Moreover, what is (area->vm_mm == current->mm) doing in the existing check? * truncate() should zero the page out if it doesn't remove it from cache. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
Ricky Beam <[EMAIL PROTECTED]> writes: > syslogd isn't the blocker. The syslog functions in glibc being > called by named are the problem. Stop named from blocking on syslog > writes and the world will be happy again. So I have the glibc maintainer (and others) saying that syslog messages should never be dropped, and you saying that named should be dropping its syslog messages. One more time, from the top: named is calling syslog(), a glibc function. This function *blocks* waiting for delivery to the local syslogd, even though it is using SOCK_DGRAM sockets. There is no option to openlog() or syslog() to get non-blocking behavior (the LOG_NDELAY option means something else entirely). You are effectively suggesting that named should be rewritten not to use the glibc syslog functions at all. That strikes me as the worst suggestion so far; it would be far better for syslogd not to do name lookups. But my syslogd has no option to avoid name lookups; I will submit a request to add one. > I've gotta ask what kind of "load" can cause this to happen. > And for the record, syslogd shouldn't be doing DNS lookups for > things arriving via /dev/log -- that's always the local machine. This particular syslogd also accepts messages from remote hosts. So when there is a lot of syslog traffic, this syslogd talks to named a lot. named occasionally sends messages to syslog. Since syslog pauses waiting for named to respond to name queries, and named blocks waiting for syslog to consume the message, a deadlock is triggered. True, it is not a full deadlock, because the name query times out eventually. But it is bad enough that the system becomes largely non-responsive. - Pat - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > Yes. Bad news. No problem was catched in filemap_nopage, but one > (of 57000) pages was dirty and had page->mapping == NULL... (maybe > only one was caused that this was just after bootup, with plenty of memory) > Maybe I should look at readahead code? Yes, you're right, read-ahead is in fact even more likely to catch the race. > In case of truncate it is going irrevocable away. Accesses after truncate > should (and sometime give you) SIGBUS... Yes. But I'd rather have a small race where we under certain strange circumstances forget to raise a SIGBUS than have a kernel bug that causes oops and possible filesystem corruption. So I'm not just looking for a fix, I'm also looking for a SIMPLE fix, with perhaps some major surgery later (things like making the inode semaphore a read-write semaphore and just always synchronizing 100% on i_size - which would fix the problem, but is not a 2.4.x thing, no way Jose). > First page->mapping == NULL entry in syslog is dated 22:23:58, but > couple of entries was lost before (probably I should print only '.' for > each such page; this run there was more than 100 such pages) > Another question is why SIGCHLD was delivered to parent AFTER ftruncate, > but exit was invoked couple of seconds before - maybe it syncs > child address space to disk? exit() basically does do a msync(MSASYNC), so that could be it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Alexander Viro wrote: > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > Al, any ideas? I have this feeling that the simplest fix is just to leave > > the race open, and make truncate_complete_page() just leave such a "racy" > > page in the page cache. It will still race, and the invalid page will > > still exist, but the end result should be harmless. > > Provided that we clean it - why the hell do we want to take it out of > the pagecache? I don't see any fundamental reasons to prohibit pages > past the ->i_size being hashed. In fact, we used to allow them. We do want to remove it from the page cache under normal circumstances: it makes for much better MM behaviour (ie free pages that are truly useless). But I suspect that just adding a test for "page_count(page) == 1" (ie same as for the page cache invalidation) before freeing it should give that advantage, along with avoiding the problematic unlikely case... Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: VMWare and kswapd
On 23 Oct 00 at 16:07, Michael Rothwell wrote: > I'm trying out the new VMWare for Linux, and noticed that if I let it > run for a couple of days, I get this: > > Oct 16 15:07:49 cartman kernel: VM: do_try_to_free_pages failed for > kswapd... > > ... after which things go to hell pretty fast. The previous VMWare would > oops the kernel so badly that I would go from working away to seeign the > BIOS screen suddenly. Unfortunately, the new problem doesn't produce any > other data, even and oops. If I produce memory contention (running > Netscape at the same time as VMWare, for instance), it happens sooner. > > This is on a 2.2.16+USB kernel. You may want to upgrade to 2.2.17, or newer... > I know the LKML isn't VMWare tech support. I'm just wondering if anyone > else has been getting similar results but better data so that a bug > report can be sent to VMWare. It is better to ask in news://news.vmware.com, in vmware.for-linux.{experimental,misc}. But I recommend you to buy more memory, or downgrade window manager. If you have only 128MB of RAM, and you are running Gnome with XF4.0, and VMware with 80MB virtual NT, you can stress system beyond limits... Petr Vandrovec [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Topic for discussion: OS Design
On Mon, 23 Oct 2000, Dennis wrote: > This is typical of the "linux mentality". Why do other OSs have solutions > that work, yet linux's method requires special coding? If it "has to be > done that way", why do other OS's have solutions that dont do it that way? > the size of the buffer is an annoyance but not a serious problem however. > I'm not sure that Linux requires any special coding. > printing directly to the console (as BSD does) is useful when debugging a > panic, as you can trace right to the panic point. Also certain levels of BSD does not write directly to the console. Its console is a direct clone of Linux. I'm not sure which came first, but when you have a single screen-card there are not too many ways to get the character and attribute into screen memory. Linux allows the console to be redirected to a serial port. BSD does not last time I checked. > > > >Bugs that were found when changing the design of various kernel > >procedures, have been back-ported to the stable kernel series. > > I never use development kernels, what Im talking about is each major > release is like starting from version 1.0. By the time it stabilizes, the > next major puts it back to square one. What? The API has remained consistent since 0.99. It's only internal kernel stuff that has changed. If you wrote code that worked on 0.99, it will still compile and work on 2.2.17. You could not have written code for 0.99 that used mmap() and some other stuff because it had not been developed yet. However, all the "Unix stuff" like read/write/open/close/ioctl/lseek, etc., and their buffered versions like fopen() from the 'C' runtime library, have had the same API since day one. Linux was developed, from the start, to have a POSIX compatible API. Most of that API comes from the 'C' runtime library, the exact same API used by BSD and all the other OS's to which the GNU library has been ported. The only reason to get the 'latest' version is to take advantage of increased functionality. This, by definition, means that something has changed. That's what you upgrade for. The word "unstable" is a misnomer. > My point is that there is no "stable kernel series". 2.2.0 wasnt stable, > and neither was 2.2.3. Virtually all of the ethernet drivers still lock up > under heavy load in 2.2.17...and now with 2.4 there are more countless > adventures ahead Which Ethernet drivers are you having trouble with? The ones that had lockup problems (incidentally hardware related), now have reset code that runs off a watch-dog. > an example of "poor planning" is that skput and skpush will panic the > kernel if there is no room (this can happen with multiple encapsulations) > The proper behaviour would be to return a NULL pointer indicating failure, > and then to drop the frame and issue a warning. The proper response to any resource not being available (in networking) is to drop the packet on the floor, smash it into little pieces, and don't tell anybody about it. The packet will be sent again. But, if you can't transmit a packet, therefore freeing a buffer, what do you do? What you do is realilize that the failure to transmit was likely caused by a disconnected wire. In the drivers I use, I simply pretend that every packet got transmitted okay. This usually involves a one or two-line modification to the driver. This has nothing to do with poor planning. It just has to do with a design decision that I didn't agree with. Somebody decided that network data was precious and therefore the machine should kill itself if necessary to get the data through. I didn't agree with this so I changed a few lines of code. You can't kill any of my machines by flooding them and they never lock up. Further, they run at 85 to 90 percent of the network physical layer bandwidth. My main machine is our domain name-server, it gets between 2000 and 5000 hits per second. If it crashed, our whole LAN goes down. It doesn't. It runs Linux-2.2.17. Script started on Mon Oct 23 16:12:02 2000 # rlogin boneserver Password: Last login: Mon Oct 23 11:28:29 from chaos.analogic.com Linux 2.2.17. [3g[25;9HH[25;17HH[25;25HH[25;33HH[25;41HH[25;49HH[25;57HH[25;65HH[25;73HH <)0 # uptime 4:12pm up 24 days, 22:21, 11 users, load average: 0.81, 0.62, 0.00 # exit logout rlogin: connection closed. # exit exit Script done on Mon Oct 23 16:12:27 2000 Those 11 users are all network servers including samba for M$ connectivity. One of the major advantages of Linux is that if you don't like a design decision that was made, you are free to do it over the way you think is right. Sometimes you can convince others that your way is better. Sometimes not. If so, your patch makes it into the main-line kernel, if not, you patch your own future kernels so you get to retain your improvements. FYI, if AC did not exist, another would appear to fill the vacuum. Don't bitch. Make some improvements and send patches. Cheers, Dick Johnson Penguin :
Re: mount: Unable to handle kernel paging request at virtual address
David Dyck wrote: > > I am getting a repeatable oops during the boot up phase, > with linux 2.4.0 test10-pre4 > > Even a simple "mount /proc" command yields an oops. > I believe I have the latest mount program. > > Unable to handle kernel paging request at virtual address 08067000 > c01f90d0 > *pde = 07f42067 > Oops: > CPU:0 > EIP:0010:[] > Using defaults from ksymoops -t elf32-i386 -a i386 > EFLAGS: 00010206 > eax: ebx: ecx: 00a0 edx: 08067280 > esi: 08067000 edi: c7ec3d80 ebp: c7f3ffbc esp: c7f3ff64 > ds: 0018 es: 0018 ss: 0018 > Process mount (pid: 18, stackpage=c7f3f000) > Stack: c7f3e000 08066280 1000 c0134610 c7ec3000 08066280 1000 c7f3e000 >08066270 08066260 080662b0 c7ec3000 0009 c01349b2 08066280 c7f3ffbc >c7f3e000 c0ed 080662b0 bb84 c7f3e000 c010906b > Call Trace: [] [] [] > Code: f3 a5 89 c1 f3 a4 89 c8 5b 5e 5f c3 8d 74 26 00 57 56 8b 7c > > >>EIP; c01f90d0 <__generic_copy_from_user+30/40> <= > Trace; c0134610 > Trace; c01349b2 > Trace; c010906b > Code; c01f90d0 <__generic_copy_from_user+30/40> <_EIP>: > Code; c01f90d0 <__generic_copy_from_user+30/40>0: f3 a5 >repz movsl %ds:(%esi),%es:(%edi) <= > Code; c01f90d2 <__generic_copy_from_user+32/40>2: 89 c1 >mov%eax,%ecx > Code; c01f90d4 <__generic_copy_from_user+34/40>4: f3 a4 >repz movsb %ds:(%esi),%es:(%edi) > Code; c01f90d6 <__generic_copy_from_user+36/40>6: 89 c8 >mov%ecx,%eax > Code; c01f90d8 <__generic_copy_from_user+38/40>8: 5b >pop%ebx > Code; c01f90d9 <__generic_copy_from_user+39/40>9: 5e >pop%esi > Code; c01f90da <__generic_copy_from_user+3a/40>a: 5f >pop%edi > Code; c01f90db <__generic_copy_from_user+3b/40>b: c3 >ret > Code; c01f90dc <__generic_copy_from_user+3c/40>c: 8d 74 26 00 >lea0x0(%esi,1),%esi > Code; c01f90e0 <__strncpy_from_user+0/30> 10: 57 >push %edi > Code; c01f90e1 <__strncpy_from_user+1/30> 11: 56 >push %esi > Code; c01f90e2 <__strncpy_from_user+2/30> 12: 8b 7c 00 00 mov > 0x0(%eax,%eax,1),%edi This should have been trapped by the exception handling routines. One possible explanation is that the exception table is not sorted correctly by the linker. This can happen if an exception entry is made for an address that is in another section than .text. The exception handler does a binary search which can be tripped up by an out of sequence entry. Hmm, I wonder if GCC inlined do_test_wp_bit(). That would put an exception in the .text.init section. Could you check to see if this happened? -- Brian Gerst - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
make -j 2 broken?
Hi, Trying to compile the current kernel (test10-pre4) with: > make clean > make -j 2 bzImages modules modules_install will try to install the modules before they are built... This has previously been working (at least in early testX kernels). make --version GNU Make version 3.79.1, by Richard Stallman and Roland McGrath. /RogerL -- Home page: http://www.norran.net/nra02596/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On 23 Oct 00 at 16:13, Alexander Viro wrote: > On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > Al, any ideas? I have this feeling that the simplest fix is just to leave > > the race open, and make truncate_complete_page() just leave such a "racy" > > page in the page cache. It will still race, and the invalid page will > > still exist, but the end result should be harmless. > > Provided that we clean it - why the hell do we want to take it out of > the pagecache? I don't see any fundamental reasons to prohibit pages > past the ->i_size being hashed. Methods _must_ check for ->i_size, > but they do it anyway. All race-prevention is based on page locks and > ->i_sem. > > Yes, filemap_nopage() should check for i_size at the very end and fail if > the page became off-limits. But that's completely unrelated issue - it's > mmap semantics, not pagecache one. Yes. Bad news. No problem was catched in filemap_nopage, but one (of 57000) pages was dirty and had page->mapping == NULL... (maybe only one was caused that this was just after bootup, with plenty of memory) Maybe I should look at readahead code? Although to be clear I do not know why. Unless there is bug in logic in test program, it should first dirty pages, and AFTER that it should truncate - and unmap and exit, without ever touching pages of mapping... My first testcases were with this race (and with raw devices), but then I found (by removing more and more code) that no race (and no raw devices) are required... > The point being: we should _never_ drop ->mapping unless the page is > irrevocably going away. We can (and probably should) drop the off-limits > page as soon as ->count hits zero, but we should not do it before that. In case of truncate it is going irrevocable away. Accesses after truncate should (and sometime give you) SIGBUS... total used free sharedbuffers cached Mem:255768 42208 213560 0496 18420 -/+ buffers/cache: 23292 232476 Swap: 530136 13200 516936 Strace of another run: 1688 22:23:41.748438 execve("./oopsdemo", ["./oopsdemo"], [/* 18 vars */]) = 0 1688 22:23:41.749058 brk(0)= 0x8049ae8 1688 22:23:41.749399 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40016000 1688 22:23:41.749641 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) 1688 22:23:41.749861 open("/etc/ld.so.cache", O_RDONLY) = 4 1688 22:23:41.750011 fstat(4, {st_mode=S_IFREG|0644, st_size=43818, ...}) = 0 1688 22:23:41.750253 old_mmap(NULL, 43818, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40017000 1688 22:23:41.750408 close(4) = 0 1688 22:23:41.750538 open("/lib/libc.so.6", O_RDONLY) = 4 1688 22:23:41.750676 fstat(4, {st_mode=S_IFREG|0755, st_size=1057576, ...}) = 0 1688 22:23:41.750878 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\224\314"..., 4096) = 4096 1688 22:23:41.751173 old_mmap(NULL, 1072484, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0x40022000 1688 22:23:41.751327 mprotect(0x4011e000, 40292, PROT_NONE) = 0 1688 22:23:41.751441 old_mmap(0x4011e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xfb000) = 0x4011e000 1688 22:23:41.751633 old_mmap(0x40124000, 15716, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40124000 1688 22:23:41.751809 close(4) = 0 1688 22:23:41.753291 munmap(0x40017000, 43818) = 0 1688 22:23:41.753554 getpid() = 1688 1688 22:23:41.753785 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(4, 1), ...}) = 0 1688 22:23:41.753993 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40017000 1688 22:23:41.754162 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0 1688 22:23:41.754430 write(1, "Go\n", 3) = 3 1688 22:23:41.754727 open("ram0", O_RDWR|O_CREAT, 0600) = 4 1688 22:23:41.754949 unlink("ram0")= 0 1688 22:23:41.755103 ftruncate(4, 234881024) = 0 1688 22:23:41.755228 old_mmap(NULL, 234881024, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x40128000 1688 22:23:41.755700 pipe([5, 6]) = 0 1688 22:23:41.755847 pipe([7, 8]) = 0 1688 22:23:41.756024 fork()= 1689 1689 22:23:41.756735 close(6 1688 22:23:41.756798 close(5 1689 22:23:41.756844 <... close resumed> ) = 0 1688 22:23:41.756894 <... close resumed> ) = 0 1689 22:23:41.756949 close(7 1688 22:23:41.756997 close(8 1689 22:23:41.757041 <... close resumed> ) = 0 1688 22:23:41.757089 <... close resumed> ) = 0 1689 22:23:41.757139 close(4 1688 22:23:41.757188 write(6, "\0", 1 1689 22:23:41.757250 <... close resumed> ) = 0 1688 22:23:41.757301 <... write resumed> ) = 1 1689 22:23:41.757355 read(5, 1688 22:23:41.757405 read(7, 1689 22:23:41.757450 <... read resumed> "\0", 1) = 1 1689 22:23:49.260756 write(8, "\0", 1) = 1 1689 22:23:49.436204 read(5, 1688 22:23:49.442969 <... read resumed> "\
Re: Linux's implementation of poll() not scalable?
Hi, If you have a similar machine (in terms machine configuration) for both your solaris and linux machines... could you tell us what the difference in total time for 100 and 1 was? i.e... dont compare solaris with 100 descripters vs solaris with 1 descriptors, but rather Linux 100 descripters Vs. Solaris 100 descriptors AND Linux 1 descriptors Vs. Solaris 1 descriptors. That would be useful informatio... I think. Thanks Lyle Re: Linux's implementation of poll() not scalable? [ Small treatize on "scalability" included. People obviously do not understand what "scalability" really means. ] In article <[EMAIL PROTECTED]>, Dan Kegel <[EMAIL PROTECTED]> wrote: >I ran a benchmark to see how long a call to poll() takes >as you increase the number of idle fd's it has to wade through. >I used socketpair() to generate the fd's. > >Under Solaris 7, when the number of idle sockets was increased from 100 to >1, the time to check for active sockets with poll() increased by a >factor of only 6.5. That's a sublinear increase in time, pretty spiffy. Yeah. It's pretty spiffy. Basically, poll() is _fundamentally_ a O(n) interface. There is no way to avoid it - you have an array, and there simply is _no_ known algorithm to scan an array in faster than O(n) time. Sorry. (Yeah, you could parallellize it. I know, I know. Put one CPU on each entry, and you can get it down to O(1). Somehow I doubt Solaris does that. In fact, I'll bet you a dollar that it doesn't). So what does this mean? Either (a) Solaris has solved the faster-than-light problem, and Sun engineers should get a Nobel price in physics or something. (b) Solaris "scales" by being optimized for 1 entries, and not speeding up sufficiently for a small number of entries. You make the call. Basically, for poll(), perfect scalability is that poll() scales by a factor of 100 when you go from 100 to 1 entries. Anybody who does NOT scale by a factor of 100 is not scaling right - and claiming that 6.5 is a "good" scale factor only shows that you've bought into marketing hype. In short, a 6.5 scale factor STINKS. The only thing it means is that Solaris is slow as hell on the 100 descriptor case. >Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets >was increased from 100 to 1, the time to check for active sockets with >poll() increased by a factor of 493 [or 300, respectively]. So, what you're showing is that Linux actually is _closer_ to the perfect scaling (Linux is off by a factor of 5, while Solaris is off by a factor of 15 from the perfect scaling line, and scales down really badly). Now, that factor of 5 (or 3, for 2.4.0) is still bad. I'd love to see Linux scale perfectly (which in this case means that 1 fd's should take exactly 100 times as long to poll() as 100 entries take). But I suspect that there are a few things going on, one of the main ones probably being that the kernel data working set for 100 entries fits in the cache or something like that. >Please, somebody point out my mistake. Linux can't be this bad! I suspect we could improve Linux in this area, but I hope that I pointed out the most fundamental mistake you did, which was thinking that "scalability" equals "speed". It doesn't. Scalability really means that the effort to handle a problem grows reasonably with the hardness of the problem. And _deviations_ from that are indications of something being wrong. Some people think that super-linear improvements in scalability are signs of "goodness". They aren't. For example, the classical reason for super-linear SMP improvement (with number of CPU's) that people get so excited about really means that something is wrong on the low end. Often the "wrongness" is lack of cache - some problems will scale better than perfectly simply because with multiple CPU's you have more cache. The "wrongess" is often also selecting the wrong algorithm: something that "scales well" by just being horribly slow for the small case, and being "less bad" for the big cases. In the end, the notion of "scalability" is meaningless. The only meaningful thing is how quickly something happens for the load you have. That's something called "performance", and unlike "scalability", it actually has real-life meaning. Under Linux, I'm personally more worried about the performance of X etc, and small poll()'s are actually common. So I would argue that the Solaris scalability is going the wrong way. But as performance really depends on the load, and maybe that 1 entry load is what you consider "real life", you are of course free to disagree (and you'd be equally right ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ _ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotma
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
On 23 Oct 2000, Patrick J. LoPresti wrote: >Turning down the DNS timeout would affect *all* name resolution on the >system, right? That is not acceptable. You should be able to set it on a per-process basis (via an ENV var.) >As I said, I already have a workaround, which is to have named log to >a flat file. I agree that this is a poor workaround, and the "right >fix" is to modify syslogd not to perform blocking operations. My only >quibble is that SOCK_DGRAM is an odd transport to use here, even over >AF_UNIX. syslogd isn't the blocker. The syslog functions in glibc being called by named are the problem. Stop named from blocking on syslog writes and the world will be happy again. I've gotta ask what kind of "load" can cause this to happen. And for the record, syslogd shouldn't be doing DNS lookups for things arriving via /dev/log -- that's always the local machine. --Ricky - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
Ricky Beam <[EMAIL PROTECTED]> writes: > I would suggest disabling name resolution for syslog, but that's an > ugly option. There's no way to stop a glibc system from doing a DNS > query for a reverse lookup. HOWEVER, you can set the DNS timeout to > 1 second and set the resolver options to prevent recursion (answer > from cache only.) Recursion has nothing to do with it; as I said, the named on this system is itself authoritative for all of the reverse lookups. Turning down the DNS timeout would affect *all* name resolution on the system, right? That is not acceptable. As I said, I already have a workaround, which is to have named log to a flat file. I agree that this is a poor workaround, and the "right fix" is to modify syslogd not to perform blocking operations. My only quibble is that SOCK_DGRAM is an odd transport to use here, even over AF_UNIX. > PS: Technically, this is not a lockup. syslogd should eventually > timeout waiting for the DNS query and go about it's business. Of > course, that may be upwards of 45 seconds -- very annoying. Yes. We are able to log in to the machine eventually and restart the offending processes. But that is little consolation to our users who notice the hang and the fallout afterward. - Pat - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Al, any ideas? I have this feeling that the simplest fix is just to leave > the race open, and make truncate_complete_page() just leave such a "racy" > page in the page cache. It will still race, and the invalid page will > still exist, but the end result should be harmless. Provided that we clean it - why the hell do we want to take it out of the pagecache? I don't see any fundamental reasons to prohibit pages past the ->i_size being hashed. Methods _must_ check for ->i_size, but they do it anyway. All race-prevention is based on page locks and ->i_sem. Yes, filemap_nopage() should check for i_size at the very end and fail if the page became off-limits. But that's completely unrelated issue - it's mmap semantics, not pagecache one. The point being: we should _never_ drop ->mapping unless the page is irrevocably going away. We can (and probably should) drop the off-limits page as soon as ->count hits zero, but we should not do it before that. Comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: getting include-files from arch//subdir
why not #include Amit "Heusden, Folkert van" wrote: > > I need to include (in a driver) a header-file from arch//subdir. I > could, of course, > do something like #include "../../arch/i386/{etc}" with a couple of #ifdef's > to get things > working for each environment. I guess that's now the way to do it cleanly. > What would be _the_ way to do it? > > Thanks. > > Folkert van Heusden. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
VMWare and kswapd
I'm trying out the new VMWare for Linux, and noticed that if I let it run for a couple of days, I get this: Oct 16 15:07:49 cartman kernel: VM: do_try_to_free_pages failed for kswapd... ... after which things go to hell pretty fast. The previous VMWare would oops the kernel so badly that I would go from working away to seeign the BIOS screen suddenly. Unfortunately, the new problem doesn't produce any other data, even and oops. If I produce memory contention (running Netscape at the same time as VMWare, for instance), it happens sooner. This is on a 2.2.16+USB kernel. I know the LKML isn't VMWare tech support. I'm just wondering if anyone else has been getting similar results but better data so that a bug report can be sent to VMWare. -M - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Linux's implementation of poll() not scalable?
> Under Solaris 7, when the number of idle sockets was increased from > 100 to 1, the time to check for active sockets with poll() > increased by a factor of only 6.5. That's a sublinear increase in time, > pretty spiffy. Under Solaris 7, when the number of idle sockets was decreased from 10,000 to 100, the time to check for active sockets with poll() decreased by a factor of only 6.5. Shouldn't it be more like 100? Sounds like Solaris' poll implementation is horribly broken for the most common case. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
On Mon, 23 Oct 2000, Petr Vandrovec wrote: > > I'll take a better look at the truncate case (I consider the invalidate > > case closed). Do you have a simple test-program around? > > Well, I cannot say simple. As I was not able to reproduce it with only > one task, code below: Ok, without running this I can already guess at what's up. One process does the "truncate()", and races with another process that does a page-in. The truncate code does notify_change(ATTR_SIZE); which will basically cause a "vmtruncate(inode, newsize)". Just before this happened, the other process gets a page fault, and does a page_cache_read(). Now, because we are low on memory (this is why it only shows up when you're swapping), that read will take a while. During that time, the vmtruncate() starts to execute, and sets inode->i_size to the new value (that would cause us to no longer accept a page fault - but we already got past that check in the faulting process). It will then invalidate all the inode pages > i_size. Now, the page faulting process comes back, and puts the page into the VM space. Never mind that it has in the meantime gotten i_mapping = NULL due to the other process doing the truncate. Now, if I'm right, you should be able to add something like if (!old_page->mapping) printk("mapping went away from under us\n"); to just before the "return old_page()" case in the success path of filemap_nopage() (mm/filemap.c), and you should see that printk() trigger when the bug happens. (There are other users too that are not synchronized wrt inode size changes and could get an access to a page past the end of the file this way - I just think that filemap_nopage is probably the one where this is most easily seen). The above is obviously not a bug-fix, it's just a validation of the theory. Al, any ideas? I have this feeling that the simplest fix is just to leave the race open, and make truncate_complete_page() just leave such a "racy" page in the page cache. It will still race, and the invalid page will still exist, but the end result should be harmless. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cpu detection fixes for test10-pre4
Hi! > >> > * include/asm-i386/elf.h: > >> > - make Pentium IV and other post-P6 processors use the "i686" > >> > family name (same fix as the system_utsname.machine init fix > >> > which went into include/asm-i386/bugs.h in test10-pre4) > >> > > >> > >> We should never have used anything but "i386" as the utsname... sigh. > > > >We stil can do that: make 2.4.0 use i386 for utsname on all x86 machines > >(except x86-64 ;-), and let people adapt. > > Better yet, make it use "ia32" to avoid confusion. No. i386 is already in use by most people. No need to change that. Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/