[RFC] Task Ornaments
PROPOSAL I have written a patch (appended) that gives a task structure a list of arbitrary "ornaments". Any number of ornaments can then be added without increasing the size of the task_struct. Each ornament has an operations table, much as do inodes and address spaces. These operations include a destructor and routines for destroying and duplicating ornaments under signal, exit, execve and fork/clone conditions. For instance, the frame buffer drivers use a three fields in the task structure to gain notification of signals that are blocked from being delivered to a process: - notifier - notifier_data - notifier_mask Instead, these fields could be attached to an ornament which is only present when the FB driver is doing this. They would use the signal notification operation to perform this function. In this way, the task_struct size could actually be _reduced_ by 4 bytes. Functions are provided for traversing the list and calling the appropriate notification routine on each ornament. Multiple traversals can be in progress at once, and they can be recursive, and the list can be safely modified at the same time. The implementation I've appended does work in userspace tests (as far as it is possible to test it there), though it has not been tried in the kernel as yet. One optimisation that can be made is to make alloc_lock in the task_struct an rwlock_t instead of a spinlock_t. This means that the list can be searched simultaneously be several different processes. REASON FOR IMPLEMENTATION = In the Win32 kernel module that I'm writing to help accelerate Wine, one of the things I want to do is to be able to store a Win32 handle map in a task structure (currently I attach it to an open fd). There are a number of things my code needs to be able to do: (1) Destroy the handle map upon exit, signal and potentially execve. (2) Trim the handle map during execve. (3) Duplicate/share the handle map of a process with a child process during forking/cloning. (4) A process needs to be able to add objects to the handle map of another arbitrary process that also has a handle map. (5) A process needs to be able to gain notification of the death of another arbitrary process. The way I'm proposing to do this is by providing notifications from certain routines, such as do_fork(). There are a number of further considerations: (1) A notification service routine can sleep (eg: on kmalloc). (2) A service routine may modify the list (eg: remove itself). (3) The list must be guarded against modifications during traversal without preventing modifications from happening. Thanks for your consideration/comments/better ways of doing it, David == diff -uNr linux-2.4.1-pre10.orig/fs/exec.c linux-2.4.1-pre10/fs/exec.c --- linux-2.4.1-pre10.orig/fs/exec.cFri Jan 26 15:57:12 2001 +++ linux-2.4.1-pre10/fs/exec.c Fri Jan 26 17:28:32 2001 @@ -541,6 +541,7 @@ if (retval) goto mmap_failed; /* This is the point of no return */ + task_ornament_notify_execve(current); release_old_signals(oldsig); current-sas_ss_sp = current-sas_ss_size = 0; diff -uNr linux-2.4.1-pre10.orig/include/linux/sched.h linux-2.4.1-pre10/include/linux/sched.h --- linux-2.4.1-pre10.orig/include/linux/sched.hFri Jan 26 15:57:16 2001 +++ linux-2.4.1-pre10/include/linux/sched.h Fri Jan 26 16:08:32 2001 @@ -394,7 +394,8 @@ u32 parent_exec_id; u32 self_exec_id; /* Protection of (de-)allocation: mm, files, fs, tty */ - spinlock_t alloc_lock; + rwlock_t alloc_lock; + struct list_head ornaments; }; /* @@ -474,7 +475,7 @@ sig: init_signals, \ pending: { NULL, tsk.pending.head, {{0}}}, \ blocked: {{0}}, \ -alloc_lock:SPIN_LOCK_UNLOCKED \ +alloc_lock:RW_LOCK_UNLOCKED\ } @@ -859,12 +860,12 @@ static inline void task_lock(struct task_struct *p) { - spin_lock(p-alloc_lock); + write_lock(p-alloc_lock); } static inline void task_unlock(struct task_struct *p) { - spin_unlock(p-alloc_lock); + write_unlock(p-alloc_lock); } /* write full pathname into buffer and return start of pathname */ diff -uNr linux-2.4.1-pre10.orig/include/linux/taskornament.h linux-2.4.1-pre10/include/linux/taskornament.h --- linux-2.4.1-pre10.orig/include/linux/taskornament.h Thu Jan 1 01:00:00 1970 +++ linux-2.4.1-pre10/include/linux/taskornament.h Fri Jan 26 17:16:22 2001 @@ -0,0 +1,129 @@ +/* taskornament.h: task ornaments + * + * (c) Copyright 2001 David Howells ([EMAIL PROTECTED]) + */ + +#ifndef _H_F92CE22E_93FF_11D4_8781_C000574
Re: [PATCH] guard mm-rss with page_table_lock (241p11)
... + spin_lock(mm-page_table_lock); mm-rss++; + spin_unlock(mm-page_table_lock); ... Would it not be better to use some sort of atomic add/subtract/clear operation rather than a spinlock? (Which would also give you fewer atomic memory access cycles). David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: rwsem, gcc3 again
Hi Andrea: Here you go: /usr/local/bin/gcc -D__KERNEL__ -I/inst-kernels/linux-2.4.5-pre2-aa/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe -mpreferred-stack-boundary=2 -march=i686-DEXPORT_SYMTAB -c sys.c sys.c: In function `sys_gethostname': /inst-kernels/linux-2.4.5-pre2-aa/include/asm/rwsem_xchgadd.h:51: inconsistent operand constraints in an `asm' I've lit fires underneath some of our gcc people, and they say it's definitely a bug in gcc. They're currently looking at it. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 00_rwsem-11, 2.4.4-ac11 and gcc-3(2001-05-14)
The compiler should be now fixed in this respect, for both my stuff that's in the kernel and Andrea's desired replacement. The problem appears to have been triggered by having two input+output constraints (eg: +r, +m). However, I can't test this because the head of the CVS trunk doesn't seem to have been able to build and test successfully since just before the fix was applied. (I'm going on the codesourcery builds for this). David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: __asm__
__asm__(and 1 %%esp.%0; :=r (current) : 0 (~8191UL)); This doesn't look right... Where did you get this from. Taking the one in include/asm-i386/current.h as an example: | __asm__( This signifies a piece of inline assembly that the compiler must insert into it's output code. The __asm__ is the same as asm, but can't be disabled by command line flags. | andl %%esp,%0 %% is a macro that expands to a %. %0 is a macro that expands to the first input/output specification. So in this case, it takes the stack pointer (register %esp) and ANDs it into a register that contains 0xE000, leaving the result in that register. Basically, the a task's task_struct and a task's kernel stack occupy an 8KB block that is 8KB aligned, with the task_struct at the beginning and the stack growing from the end downwards. So you can find the task_struct by clearing the bottom 13 bits of the stack pointer value. | ; The semicolon can be used to separate assembly statements, as can the newline character escape sequence (\n). | :=r (current) This specifies an output constraint (all of which occur after the first colon, but before the second). The '=' also specifies that this is an output. The 'r' indicates that a general purpose register should be allocated such that the instruction can place the output value into it. The bit inside the brackets - 'current' - is the intended destination of the output value (normally a local variable) once the C part is returned to. | : 0 (~8191UL)); This specifies an input constraint (all of which occur after the second colon, but before the third). The '0' references another constraint (in this case, the first output constraint), saying that the same register or memory location should be used for both. The '~8191UL' inside the brackets is a constant that should be loaded into the register allocated for the output value before using the instructions inside the asm block. See also the gcc info pages, Topic C Extensions, subtopic Extended Asm. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Re: __asm__ ]
Okay, current is a macro on i386 that expands to get_current(). This gets the task_struct for the task currently running on the CPU executing the code. It does this by masking out the bottom bits of its kernel stack pointer. For example, assuming that some running process has the following task record stored in an 8KB-aligned 8KB block. 0xD520BFFF +---+ | | | kernel stack | | | 0xD520B498 + TOS --+ -- stack pointer: %esp | | | empty space | | | +---+ | | | task_struct | | | 0xD520A000 +---+ -- get_current() get_current() can work out where the base of this block is because the kernel (1) stack pointer is always within it, (2) it's aligned in memory with respect to its size: get_current() { return %esp~8191; } get_current() { return 0xD520B498 0xE000; } get_current() = 0xD520A000 So current-fs is a structure that holds the current task's idea of its root filesystem (chroot), current working directory (chdir) and current umask. And so current-fs-root is the task's filesystem root dentry. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
rwsems and asm-constraint gcc bug
The bug in gcc 3.0 that stopped the inline asm constraints being interpreted properly, and thus prevented linux from compiling is now fixed. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-2.4.3-ac14 spinlock problem?
I was running something on my Dell dual p3 box (optiplex gx300). my kernel is linux-2.4.3-ac14. I got the following message: How often did this message occur? __rwsem_do_wake(): wait_list unexpectedly empty [4191] c5966f60 = { 0001 }) kenel BUG at rwsem.c:99! invalid operand: CPU:1 EIP:0010:[c0236b99] EFLAGS: 00010282 kenel BUG at /usr/src/2.4.3-ac14/include/asm/spinlock.h:104! I upgrade the kernel to 2.4.5, no such problem any more. I suspect something else corrupted the rw-semaphore structure, but that's very hard to prove unless you catch it in the act. If it happens again with any frequency, you might want to try turning on rwsem debugging. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Please by no way don't include this patch into the official tree. I wasn't intending to put it up for inclusion into the main kernel tree... I think it is far better that it stays as a module, especially as it doesn't require any changes to the main tree to be used, it can just be loaded when one wants to use Wine. It's insane due to the following: 1. Linux is UNIX not NT... (in terms of API) 2. WINE in itself is barely usefull - even in fact non existant, since there is no official stable release out there. It is actually proving very useful, at least to me... It allows me to use PVCS (a horrible java-based version control system for windows only) at work whilst running Linux. It also allows me to run MSDEV without having to reboot. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Tigran Aivazian [EMAIL PROTECTED] wrote: how about the classical standard way of "getting to kernel space", i.e. plain system calls? Unless you really need a huge number of new system calls, Just the one... this'll take a parameter specifying the Win32 call to make: int win32(int fd, enum win32op, void *args, u_int *win32error) The fd is currently used to find the Win32 handle map, which is attached to struct file-f_private on that fd. I don't know any other way of doing this at the moment (at least so the handle map automatically gets destroyed when the owning process expires). One thing that might be useful in this regard is the ability to add callbacks into a list on the task structure, such that they are invoked when the process exits/execs (and at no other time). what you can do is in the module, scan for sys_ni_syscall entries in the sys_call_table[] table (see entry.S) and replace them with pointers to functions in your module's address space. Then you need to communicate the slot numbers thus occupied back to userspace, e.g. via /proc file which then can invoke them in a standard int 0x80 manner. Yes... that's a good idea... I already use a /proc file to provide access to the kernel module, I can just extend that. Or you can hardcode the slots but that is not flexible because unoccupied slots may become occupied at any time in the future. Unless I can get one reserved, though I suspect that won't happen. Also, if you do this - don't forget to inc/dec module refcount in each of your new syscalls, for obvious reasons - if the syscall sleeps and the module is unloaded, it will return to "nowhere". That is actually quite a big problem... How do you stop someone accessing the syscall whilst the module's cleanup function is running, but before it has a chance to clear the syscall slot? Does module unloading lock out other syscalls? This may require some sort of in-the-main-kernel syscall register/unregister functionality that takes a handler pointer _and_ a module definition pointer. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Andi Kleen [EMAIL PROTECTED] wrote: It could use a read-only-to-clients shared memory with some locking tricks. You still have to be able to emulate WaitForMultipleObjects() which I think is quite difficult from userspace. It can perhaps be done with signals, but that then incurs costs in invoking signal handlers. And however it is done, it would require thread-safe wait queues to be implemented in userspace. The kernel's wait queues make this a lot easier, quicker and safer. I looked a bit over the code. Your Mutex classes do not look very SMP safe, have they been tested with SMP ? Look carefully... It uses the atomic bit set/clear functions to modify the state, and the wait-queue carries its own lock. And yes, they have been SMP tested... I have a dual Pentium-II box at home. I ran my five-thread test programs on it and they ran extremely fast without crashing (faster than Win2000 on the same box with the same test). Also I guess it would be better to implement Mutexes in user space as far as possible and only use kernel help for inter process mutexes. What's the difference? And how do you tell that a mutex is going to be used only between threads and not between processes? Anonymous mutexes don't actually help - think of DuplicateHandle! Plus it only greatly complicates your WaitFor* functions. For the object name management I did not found any limits that would prevent an user from allocating all memory. Are there any? getname() implies MAXPATHLEN (the VFS name to kernel-space function). wineserver_read is definitely buggy, it holds spinlocks while calling functions that may sleep (e.g. the copy_to_user in wineserver_read_printf) Good point. Overall it does not look too bad. It would be nice though if there was an easy way to extend the object management from user space, because Windows probably has a lot of object types Linux definitely does not want in kernel space. There will be... though there aren't actually all that many "kernel" object handles... most of what are called "handles" are really just userspace object handles issued by userspace DLL's. However... for instance, the way I'm planning on implementing the registry through this module is to have a cut down RPC server (using the RPC mechanism available in that module) that gets invoked implicitly by RegOpenKey, et al. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Andi Kleen [EMAIL PROTECTED] wrote: But that's not race free on SMP. Two CPUs can set the bit in parallel and you'll never notice. You would need at least a protecting spinlock between the test bit and set bit (or a cmpxchg on x86) Are you sure? I understood that the "lock" prefix on a i386 made the instruction it guarded SMP safe. If not, I suppose I can use the xchg() macro instead. Hold on a moment... You said "between the test bit and set bit"... this is a single CPU instruction! With the lock prefix, there should be no between. Also, a quote from asm/bitops.h: - /* - * These have to be done with inline assembly: that way the bit-setting - * is guaranteed to be atomic. All bit operations return 0 if the bit - * was cleared before the operation and != 0 if it was not. - * - * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1). - */ Doesn't "atomic" mean SMP safe? What's the point in the lock prefix if it doesn't make things SMP safe (after all, the unadorned instructions are UP safe...)? That also does not help for protecting your objects, it only protects the wait queue. Yes. I don't see what more needs to be done, at least on the mutex. It is either available when we come to grab it, and we find this out and grab it in a single operation, or it isn't, and we wait. Furthermore, the objects also have atomic_t reference counts like other kernel modules, so they aren't going to go away whilst in use. The processes look it up in a shared name table (shmem). Yes they have to negoitate in this case -- it'll only be a win when they usually do not share. And really horrible when they do share. Plus you still have to jump through all the same synchronisation hoops to control access to the shmem that the kernel does when governing access to its objects. Probably, it is a DoS. Surely a DoS attack is only possible if there is a limit in force? Namespace DoS attacks are possible whatever the underlying system, wineserver, kernel module, or Windows itself. Of course, when I said "number of processes (not threads) * ~1020" what I actually meant is "number of interested processed...", since any process not interested can't actually issue the Win32 calls anyway. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Horst von Brand [EMAIL PROTECTED] wrote: Lost me there. If after releasing the mutex it is free, the release was sucessful AFAIAC. If two threads try to do it at the same time, so what? Releasing an already free mutex is broken, OK. But two threads owning the mutex at the same time is much worse... Because the mutex release is done in two stages: (1) clear the state bit and (2) set the owner pointer to NULL. This is a sufficient window for a potential mutex grabber to jump in grab the mutex and then have stage (2) clobber the owner pointer. Anyway, I've changed this now (though I haven't got around to testing it yet): On platforms that support cmpxchg(), it uses the owner pointer as the state and exchanges it to NULL to release the mutex, and exchanges it to the owner pointer when grabbing. On platforms that don't, a spinlock is added to the WineMutex structure. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Dan Maas [EMAIL PROTECTED] wrote: This is done all the time -- with ioctl(). It's perfectly normal to create a special character device that just responds to an ioctl for each operation you want to perform. See eg any sound card driver... Yes, that's how I'm doing it at the moment. However, this incurs a penalty because of the the number of standard ioctl()'s the system checks for first, before passing the command to my handler. What Robert suggested was to use write() to do the deed, which I don't think is a good idea. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Oliver Neukum [EMAIL PROTECTED] wrote: What is the difference to get one reserved syscall and multiplex it ? This is what I'd like to be able to do... that way the checks that ioctl() performs can be avoided. However, there are problems with doing this: (1) There's currently no definitive way to grab an unused syscall, whether a random one or a pre-determined one. (2) The syscall table is not exported... (3) Even if it was... just filling in the syscall slot from a module means that it is possible for the module to be unloaded whilst the syscall is in use. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Oliver Neukum [EMAIL PROTECTED] wrote: So ask Linus for one. The streams group got one. Why shouldn't yo ? Well, that's up to Linus... but from his email on this subject, he might well. Having a static syscall should be more efficient, too. A little... otherwise it's a matter once per process of reading a line from the /proc file, pulling out the number (or reading via an ioctl) and keeping it in a global variable, rather than having it as immediate data on an assembly instruction. It need not be. True... but it isn't at the moment. You should be able to steal a solution from knfsd. It faces the same problem. Of course it takes a small stub that's always in kernel, but so what. Interesting idea... I'll take a look. Maybe it'll be possible to share something. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
"Albert D. Cahalan" [EMAIL PROTECTED] wrote: The system call is needed of course, since that lets Linux executables (perhaps ones being ported from Win32) use the new features. It also means that non-i386 and non-wine use these services if they want to. You might as well also handle int 0x2e with the same code. I suppose... provide three modules maybe: (1) the object and handle support mechanism; (2) the Win32 API (what of it I decide to implement in a kernel module) and syscall; and (3) the Native API and interrupt. See www.sysinternals.com for some native API documentation. In addition to the web site, there are articles and a book. Been there... he says that you can't get all the API documentation. The native API is generally better for Linux, because it is smaller and less bloated than Win32. Agreed. Putting Win32 in the kernel is almost like putting libc in the kernel. You can make exceptions for calls that need the greatest performance and/or don't translate well. Also agreed, particularly when GDI calls are considered. Though this also applies to the Native API. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
David Woodhouse [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] said: The best solution is to have a stub which is always resident that does the inc/dec of the module count. This stub can reserve the syscall number as well. _Please_ don't have a stub in the kernel which is conditionally compiled according to the value of a _MODULE option. I have to admit, the thought hadn't occurred to me to do that. It almost sounds like a good idea... after all the stub can't be used if no modules can be loaded. Are you thinking, then, that the stub can't be used by compiled in (ie non-module) code? Or am I misunderstanding what you mean? David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
David Woodhouse [EMAIL PROTECTED] wrote: Unless it's _absolutely_ necessary, the kernel image (i.e. vmlinux) should not contain any code which is dependent on CONFIG_*_MODULE options. Therefore, stuff like... #ifdef CONFIG_WIN32_MODULE EXPORT_SYMBOL(my_win32_helper_func); #endif ...would mean that you have to recompile the kernel and reboot to enable the module, rather than just compiling and loading the module. Ah... I did misunderstand you. I thought you meant CONFIG_MODULES in general, which'd be okay - obviously, if module support is disabled, you can't load a module anyway. No, I wasn't planning to do that. I was thinking more along the lines of adding another handler in to struct exec_domain, but that isn't especially generic. I think now that I'm probably best providing a generic pluggable syscall handler, one that is very careful to make sure the syscall can't be entered whilst the module is being unloaded. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Richard Guenther [EMAIL PROTECTED] wrote: Whats the problem with just not allowing the module to unload at all? Whats the point in unloading a module anyways. Ok, I know - debugging, etc. - but for a "release" version this is not important. Also upgrading - but for desktop boxes (for which this wine stuff is) rebooting seems appropriate here, too. It's not all that hard to do... and it does mean you can have demand-loadable and -unloadable modules. Also I am writing/debugging it at the moment, and so I like being able to unload and reload the module. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
David Woodhouse [EMAIL PROTECTED] wrote: You don't need to add another handler. I already overloaded the lcall7 handler by passing an extra int into it to tell it the type of call which is causing it to be invoked. Values which are already used are 7 for iBCS calls (lcall7) and 0x27 for Solaris/x86 syscalls (lcall 0x27). Pick another number, and possibly reassign those two to 1 and 2 respectively because it's cleaner like that. But where do I get the other argument (struct pt_regs *) from? A normal Linux syscall does not appear to have access to such a beast... David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Oliver Neukum [EMAIL PROTECTED] wrote: I think now that I'm probably best providing a generic pluggable syscall handler, one that is very careful to make sure the syscall can't be entered whilst the module is being unloaded. This seems to me the best idea. However I would argue against dynamically allocating syscalls. Reserving numbers makes for better code and allows you to do autoloading. Now there's an idea... have the kernel autoload modules based on a particular syscall number being handled in that module. Madness *grin*. You still have to have a static syscall number though... David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
David Woodhouse [EMAIL PROTECTED] wrote: The code I posted yesterday is a bit of an abuse of the personality mechanism, but ought to work nonetheless. Didn't you like it? I didn't know how to get hold of a "struct pt_regs*" till someone sent me a message this morning (it was obvious really). Plus, it occurred to me that I already have this fd I can use for reference counting. [EMAIL PROTECTED] said: thus retaining the existence of the struct file, which then retains the module in memory. er.. why does it keep the module in memory? Because "getting" the struct file from the fd increments the usage count on the struct file. The VFS automatically maintains the module count pointed to by the file_operations structure based on the existence of the struct file. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
At the moment, the Win32 syscalls I implement require an fd to be attached to a particular proc file. This fd holds the Win32 handle map. The VFS provides auto-refcounting on modules that provide file operations, thus the syscall stub only needs to check that the fd provided has the correct set of file operations attached and increment the refcount on the fd itself. This has the added advantage that not only will the module not go away during a syscall, but the handle map will not depart either. There are other ways to make sure the module doesn't go away, for instance using personalities, but I still need to maintain the refcount on the handle map fd. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] To export the access_process_vm function
Could you please apply this little patch to export access_process_vm()? Cheers, David Howells = diff -uNr linux-2.4.0-test8-orig/kernel/ksyms.c linux-2.4.0-test8/kernel/ksyms.c --- linux-2.4.0-test8-orig/kernel/ksyms.c Fri Sep 15 00:04:36 2000 +++ linux-2.4.0-test8/kernel/ksyms.cMon Sep 11 23:39:38 2000 @@ -123,6 +123,7 @@ EXPORT_SYMBOL(find_vma); EXPORT_SYMBOL(get_unmapped_area); EXPORT_SYMBOL(init_mm); +EXPORT_SYMBOL(access_process_vm); #ifdef CONFIG_HIGHMEM EXPORT_SYMBOL(kmap_high); EXPORT_SYMBOL(kunmap_high); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup dynamic system calls are not going to happen.
[EMAIL PROTECTED] (Linus Torvalds) wrote: David Howells [EMAIL PROTECTED] wrote: Oliver Neukum [EMAIL PROTECTED] wrote: (3) Even if it was... just filling in the syscall slot from a module means that it is possible for the module to be unloaded whilst the syscall is in use. Note that all this is not the problem. The problem is that dynamic system calls are not going to happen. License issues. I will not allow system calls to be added from modules. Because I do not think that adding a system call is a valid thing for a module to do. It's that easy. It's the old thing about "hooks". You must not sidestep the GPL by just putting a hook in place. And dynamic system calls are the ultimate hook. What about nfsd's "nfsservctl" syscall then? This appears to be modulable... Anyway, I didn't mean using a random syscall number provided by the kernel upon loading the module, but having one number preallocated for the purpose as is done in knfsd. Do you think, then, that I should be putting my stuff up for inclusion in the kernel proper (say v2.5)? Or do you think I shouldn't be using a syscall at all? Or should I just be offering it as a kernel patch that has to be applied for use? David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
At the moment, the Win32 syscalls I implement require an fd to be attached to a particular proc file. This fd holds the Win32 handle map. Huh? Each process needs a handle map... To avoid playing with the task structure, fork, exec, exit, signals, etc., I used an fd attached to a proc file to keep track of the handle map (it gives me reference counting, module tracking, and auto-cleanup on process death). The handle map page is just attached to the private data pointer on the struct file. It was just convenient for starting the whole thing off. There are other ways to make sure the module doesn't go away, for instance using personalities, but I still need to maintain the refcount on the handle map fd. Please don't mess with personalities. I intend to use any useful Win32 or NT-native features in Linux software. Personalities would make these calls only useful for Wine. Agreed. If a Win32 handle can be though of as a file descripter, then just do that. You can invent new file types and new filesystems as needed. These are available: 0x3000 003 0x5000 005 0x7000 007 0x9000 011 0xb000 013 I was trying to avoid using file/inode/dentry structs as much as possible... they aren't really necessary for most of this, and are very heavyweight. If that isn't right, add a CLONE_HANDLES flag for clone() and add a fd-like table hanging off the task struct. Go all the way, rather than leaving these features as second-class citizens. I have thought of separating the code in to separate lumps: (1) An in-kernel resident lump, providing very basic services: * handle/handle-map management * object management * some object types * process death-knell notification * file-change notification * unicode string handling/conversion (steal/share from NTFS) * simple RPC mechanism for building certain services in userspace * Win32 syscall stub * int 0x2e stub * /proc view (2) A partial basic Win32 implementation (modulable): * basic Win32 syscalls, eg: CreateMutexA/W * _no_ GDI calls (3) A partial basic Native API implementation (modulable): * basic NT syscalls * again, no GDI calls (4) A userspace registry service, communicating with registry stubs in the kernel by the simple RPC mechanism. I'd rather not have kernel code that is _only_ good for Win32. I don't see much reason to have a nasty add-on that can't be used by regular Linux programs. Agreed. The question is, how much will Linux go for? David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
"Albert D. Cahalan" [EMAIL PROTECTED] wrote: Well, if this isn't worth doing right... Patch size is just something you have to deal with. Hopefully you can get an early-2.5.x merge. I wouldn't like to bet on it. Why limit it? If the existing poll method needs extension, do it. I suppose so. On a separate issue, the more I look at the poll and select implementation, the more I'm convinced there's a leak, but I'm not sure it matters. That may be, but we deal with it all the time. The VFS code has to be well-optimized and SMP-safe. Besides, many of these Win32 handles will refer to files anyway, won't they? Run HandleEx sometime... most of the handles are generally not files, particularly for a program that uses the perfmon stuff, when it uses heaps of registry handles and section handles. * The dentry filename support does not handle separate namespaces. By this you mean...? Can you not just apply a prefix? Perhaps you could use the half-way mounted filesystems, with some way to specify the desired filesystem root. Actually, I suppose could look at how /proc does things. Make anonymous objects just numbered files in a special directory. Well, that would be generally useful for Linux software. But hard to achieve at the moment... Not only would the VFS have to change, but all the filesystems too. Also useful: an open-read-close operation that doesn't muck around with fd allocation. Maybe hang the data off of a struct stat, so that callers can atomicly get everything. Better still, just give the syscall a filename, a buffer and a buffer size. * Win32 access/share flags would have to be retained in the file struct, and the inode struct would have to maintain a list of these. OK. Problem? Not a nice thing to add. Care has been taken to decouple the file from the inode (a dentry sandwich). Also it means more maintanence work on the part of the VFS. Care would also have to be taken to avoid a reference count loop: File points to Inode points to File. Not insurmountable though. The list would be NULL most of the time. If Linux apps start using this feature a lot, then it can be optimized. No. The list could never be NULL. Excuse me? Does Win32 not allow rename of open files? It does, but the internal object name does not change. You cannot create a new file of the old name whilst it is still open internally. Assuming that is the problem, you have a few options I think. You could let handles block rename, except that you might let root override restrictions caused by non-root users. *shiver* You could just ignore this problem and see what, if anything, breaks. I really don't like doing that. It's always an important customer who falls over it. You could lie I suppose, supplying old names as needed. Then you should stick with the current system. Doesn't that strike you as being a hack that will cause you to need _more_ code in the end? No. You'll end up writing a second VFS layer if I'm not mistaken. No I won't. I wasn't really thinking in terms of having a process handles map as a directory full of fd's, more like a flat file with a list in it. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Alexander Viro [EMAIL PROTECTED] wrote: Then these apps are non-portable to other Unices and either get fixed or get rm'd. Period. Definitely. In case you've missed that, Win32 is misdesigned crap. It's one thing to add binary compatibility _if_ it doesn't affect the native stuff, but merging that abortion into the native API Just Is Not Going To Happen. I agree with you on that. I'd quite like a Win32 API to be available, but I wouldn't like it to have too much impact on the main Linux kernel. I presume that you don't have a problem with it implemented in a module for use in accelerating Wine? The longer I'm looking at that thread the more it looks like an attempt to merge incompatible models. Last time that happened we got Missed'em'V and that's going to be _much_ worse. Mixing v7 with CP/M gave DOS. List goes on and on and there is _nothing_ but shit on it. True the two are pretty much incompatible... but I can implement a large amount of the basic Win32 API in kernel space without affecting on the rest of the kernel. I will, however, supply a couple of patches to add useful extra features that are internal to the kernel anyway, and don't affect the userspace API semantics. These include process death notification (which someone else expressed interest in for some other purpose) and some optional hooks to make it faster and easier to use my implementation. Folks, get real - if you want NT you know where to find it. Yes - written on the reverse of my cup mat *grin*. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Daniel Pittman [EMAIL PROTECTED] wrote: *blink* I confess that I can't see that much of a requirement for the ptrace stuff, but I take your word for it. :) I was under the impression that the wineserver used ptrace() to retrieve argument data to requests such as CreateFile(), but thinking about it, that can't be so, since that'd mean you couldn't strace a wine program, which you can. It must instead pass all the argument data either by socket or by shared memory (I'm not sure which offhand). It does, however, (at least, I think it does), use this for debugging support, but that's a special case. Secondly, the client (wine) just has to issue a syscall that looks exactly like a Win32 registry function, how it is handled is hidden by the kernel. Not really something that I would have thought was needed - specifically, the userspace libraries that the process was dynamic linked to would have been the place to do that, I thought. I suppose this is more of an implementation detail, but it's primarily why I added an RPC mechanism. Thirdly, registry functions should issue system handles, as is done on Windows. If system handles move into the kernel, then registry handles should do too. This also means registry change notifications can be implemented by system handles that WaitFor*() functions can deal with. That, however, is a convincing reason. And that's why I wanted a kernel interface to the registry stuff, for which the RPC mechanism was added. IIRC there are not any waitable GHI objects, precisely because under Win32 the GDI is a userspace library that does most of the magic internally and then, finally, requests that the server does something. Like sockets, the GDI isn't actually a Win32 kernel thing, it's a Win32 userspace wrapper that makes the minimalist kernel support more palatable. However, I could make a Window a waitable GDI object... implementing GetMessage() as Wait, GetAsync. What I'd need to do is make it possible for the RPC server to create signallable handles of effectively user-defined type that can be issued by and passed through RPC calls. Hmmm... I'll have to think about that, but I'll probably have to do something along these lines to support registry change notifications. Now there's an idea... Have the local X server as an RPC server, handling wine GDI calls directly *grin*. Then I could hide all the GDI stuff behind syscall stubs in the same way. This would allow me to implement a lot of the NT Native API. Well, that's a point of view thing. The native NT API isn't anything beyond kernel32, as such. Certainly, the GDI isn't part of the core API and, in some cases, MicroSoft do release or use the NT core without the GUI system, at the least. But it is there. I think it may use WinStation handles somehow, but I don't know for sure without a bit of disassembly work. Anyhow, I am hardly an expert on the area, but... I would have avoided putting stubs into the kernel anywhere that it was possible and used the userspace dynamic Win32 libraries to do most of the work. It depends on what you want... What I'd like is basically a range of Win32 system handles that map to (or apparently map to) kernel services. I'd also like these to be available for ordinary Linux programs to play with if they want to. However, I think I might be able to see what you mean... I could implement an RPC server, say, that has one exported operation: do registry op. Then have all the registry operations mapped down by a userspace library to the RPC invocation call to a single operation on that particular server. OTOH, I would also have implemented the WaitFor* set as a wrapper over select, with the kernel module providing suitable file descriptors for adding to the set that you waited for. This is one of the things I was trying to avoid: (1) select and poll do not implement the entire range of things you can do with WaitForMultipleObjects; (2) the struct file+struct dentry+struct inode complex is incredibly heavy for most of what I want to do (though does have advantages); (3) Linux file structures do not hold enough information to support CreateFile (access sharing interactions). We have fairly different views on how this stuff would hang together and, in all likelyhood, mine are based on incorrect assumptions. :) True... Isn't it nice when you can provide what API's you want rather than having to make do with whatever is dictated to you. Yes... I'm implementing some of the Windows system APIs because I'd like to have that functionality available under Linux running as fast as possible, not because it's the only one worth using:-) David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Reiserfs would make a fine registry. Interesting idea... Have a Linux-style registry hive file mounted as a reiserfs filesystem. However, I'd prefer to be able to use Windows-style registry hive files if possible. (3) Linux file structures do not hold enough information to support CreateFile (access sharing interactions). Didn't IRIX add the access sharing support for Samba to use? This looks like a generally useful feature. I presume you mean SGI added it to IRIX? As for generally useful... possibly. The sys_open() and sys_unlink() functions could be modified to make use of it if it was stored on the inode. You could then use sys_CreateFileA() to actually use it fully. However, it won't work over NFS... David Howells
Re: [RFC] Wine speedup through kernel module
Waldek Hebisch [EMAIL PROTECTED] wrote: (2) the struct file+struct dentry+struct inode complex is incredibly heavy for most of what I want to do (though does have advantages); Well, all (ok most) of this complexity is to support Unix semantics. Adding structures with different sematics will (IMHO) mean huge increase in complexity (if interactions are handled correctly) and invites bugs I think we have a misunderstanding here... I meant that using the VFS structures for Win32 objects like mutexes, semaphores and events is massive overkill, and uses a great deal of unnecessary memory. What I did was to have a separate structure for representing Win32 objects that is kept totally separate from the rest of the Linux kernel VFS stuff. It is also a lot smaller than the three-layer VFS system, and only has the bits that are necessary for implementing Win32 object handling and waiting. Note though, in the case of a file handle, this Win32 object contains a pointer to a struct file, but this is the only way to gain access to the VFS through this system, and it still isolates the VFS effectively. (3) Linux file structures do not hold enough information to support CreateFile (access sharing interactions). Any information not in kernel structure is Wine specific anyway, so should be separate Definitely, and I've kept it so. What I meant on the CreateFile() front is that this function takes an access bitmask and a share bitmask which interact with other CreateFile()'s and some other functions. For instance, if two separate CreateFile()'s are issued on a file then the second is rejected if it's share mask excludes the first's access mask or if the first's share mask excludes the second's access mask. It gets a bit more complex as more CreateFile()'s succeed on a particular file, since the access flags are additive, and the share masks are subtractive. The point, though, is that this information must be retained to reproduce the semantics of the relevant functions. Wine keeps it in the wineserver process, and uses RPC over sockets to get to it. Everybody sometimes gets an idea which looks very promising (and yes, I too like my ideas more then the other). However, Unix has quite a lot of mechanisms for interprocess comunication: signals, shared memory, SYSV IPC semaphores and messages, sockets, pipes. A lot of kernel structers and code is just to make this work efficiently and relaiably. Putting in the kernel subsystem which to large degree has the same functionality (but with incompatible interface) seems like a recipe for disaster (small is the project dies, big is it makes its way to the kernel). Why should it be a disaster? It can just be kept as a module that is loaded to accelerate Wine. On construtive grounds: I think that nobody would object putting into kernel few primitives if they are critical for Wine performance. With what API? However, I would rather belive that they already exists. As long as I understand the main problem is Wine I/O performance. The other problem is to have efficient mutexes. I expect SYSV stuff to be reasonably efficient, and very fast mutexes may be implemented (almost) in user space using ix86 assembly. With shared memory and fast mutexes it should be possible to move I/O to client side even in multithreading programs. That entails all sorts of other problems... On Win32, a lot of these objects are named... That means you have to have some sort of atomic-access name table in shared memory. And you also have to implement WaitFor*() without doing a tight polling loop (whilst you could use signals, that is slow). Also you have to be able to do release-on-death management (how to catch SIGKILL?). The thing about the kernel is that it has pre-existing services to solve these problems: locks, wait-queues and atomic counters. You can solve _most_ of these problems in userspace without a separate server process, but at the cost of a great deal of extra complexity and at the cost of not being able to provide the full behaviour. I think that Linux programmers rather than new syscalls would like Wine to provide ability to handle metafiles (it happened), OLE streams or W32 widgets. So, from such a point of view us much as possible interfaces should be Unix-like, and W32 interfaces used only when needed. Or maybe, they'd like both to be available. The two aren't mutually exclusive. I understand that Wine has good reasons to use W32 interfaces internally, but pushing them to Linux kernel seems bad for me. Don't use it then... it'll not be mandatory. Wine has also to support OS's that can't or won't add Win32 support in the kernel. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Bernd Eckenfels [EMAIL PROTECTED] wrote: * file-change notification this is interesting for other stuff too, i think irix has an interface for that, i think its an ioctl? Someone did a file-change notification patch for Linux. I'm not sure exactly what became of it, but it'd be nice to be able to build on top of it. * unicode string handling/conversion (steal/share from NTFS) are you sure this needs to be done in the kernel? if you wat to steal the sourcecode of the ntfs driver do it, but why would u wat to share object code for that? Implementation of Win32 wide-character functions and NT Native API calls would require such things. And if I can share them with other bits of the kernel (there's more than just NTFS - smbfs, msdosfs, iso9660), then I can possibly get away with implementing less myself, and bloating the kernel less. * simple RPC mechanism for building certain services in userspace what is that? message passing? in kernel space? Not quite, though it could be used for that... It allows a client to give a server access to parts of its address space, so that the server can provide certain services. The main use I'm thinking of is having the kernelspace stubs of the registry functions transparently invoke RPC requests in a userspace registry server. (2) A partial basic Win32 implementation (modulable): * basic Win32 syscalls, eg: CreateMutexA/W * _no_ GDI calls actuelly kCGI is there? kCGI? What's that? I suppose GDI calls _could_ be done as RPC requests, similar to how I'm intending to do registry calls, but that would incur major overheads as MS found between NT3.51 and NT4, hence why they moved the GDI implementation from CSRSS.EXE into the kernel. However, I was thinking more along the lines of GDI being implemented in userspace by the client process, eg: wine. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
latest cut of wine server kernel module
What's new: * It can now make use of a dedicated Linux syscall rather than an ioctl() to provide some Win32 functions. * There is a patch for strace-4.2 included to view usage of this syscall. * It has an RPC mechanism allowing a server process to gain controlled access to a client's VM. * Some bug fixes. It can be downloaded from: ftp://ftp.infradead.org/pub/people/dwh/wineservmod-2915.tar.bz2 [Thanks to David Woodhouse for this] David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
"Albert D. Cahalan" [EMAIL PROTECTED] wrote: In spite of that, it should be considered. It allows this: $ ls -log /proc/self/fd total 0 lrwx--1 acahalan 64 Sep 21 09:12 0 - /dev/pts/4 lrwx--1 acahalan 64 Sep 21 09:12 1 - /dev/pts/4 lrwx--1 acahalan 64 Sep 21 09:12 2 - /dev/pts/4 lrwx--1 acahalan 64 Sep 21 09:12 3 - mutex:[720429] lrwx--1 acahalan 64 Sep 21 09:12 4 - event:[592] lr-x--1 acahalan 64 Sep 21 09:12 5 - /proc/14527/fd Looks nice, I know, but it may mean file handles are indirect, in that you have: struct file-struct dentry-struct inode-struct winefile-struct file-... if you can't store all the extra wine attributes on the struct file. It also means that the inode or the dentry has to maintain a list of attached file's, which I don't think it does at the moment. Any information not in kernel structure is Wine specific anyway, so should be separate It goes into the kernel structure, so that it won't be Wine-specific. SGI has already done this so that Samba would interact properly with regular UNIX software and the NFS server. (one might support SGI's API for this) Hmmm... worth investigating. I take it that this was only done for IRIX, not Linux. Yep, share bits definitely belong in the kernel. How else could you properly protect against regular (clueless) Linux software? Probably do... We already have 3 ways to do file locking, so this is only 33% more. 66%... Don't forget LockFile/UnlockFile. These work very similarly to fcntl, I think, but demand mandatory locking. Now THAT would be a disaster. Linux software ought to be able to rely on having these features available. I agree, but does Linus? Modules are very bad for this. Probably, but unless the code goes into the kernel proper, patch maintainance can be a real nightmare. The greater the amount of kernel actually changed by a patch, the worse it is. Named? That means you use the VFS. Not necessarily. For file handles, you probably should (unless you want drive-letter mappings to occur in the kernel - yuk), but the other things are effectively in separate namespaces. Don't use it then... it'll not be mandatory. Wine has also to support OS's that can't or won't add Win32 support in the kernel. You misunderstand me... _Using_ this API will not be mandatory, whether or not it exists in the kernel proper. See a pattern here? System calls are not supposed to be modular. knfsd? But I agree, really. But what is the likelyhood? We shouldn't have calls that appear and disappear on a whim. True... that smacks of MS policy:-) Binary compatibility requires that system calls be available on all Linux systems. Also true. You should be able to compile the module I currently have on all Linux systems. Unfortunately, I can't test this at the moment. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: GCC proposal for @ asm constraint
I've been writing a kernel module and I've noticed a measurable performance drop between the same code compiled against linux-2.4.0-test7 and against test8. I disassembled the code to try and work out what was going on and I saw the following happen: * [test8] One of the atomic memory access primitives supplied by the kernel was putting immediate data into a register outside of the inline-asm instruction group and then using the register inside. * [test7] The immediate data was passed directly to the inline-asm instruction. In test8, of course, this means that the compiler has to scratch up a spare register, which is totally unnecessary, as the immediate data could be attached directly to the instruction opcode as was done in test7. This had the effect of making the compiler have to push the old contents of the register into a slot on the stack (I think it held a local variable at the time), which had the further effects of using more stack memory, introducing more register rearrangement (the code ended up longer), and burning up more CPU cycles. I can't remember exactly what it was now, but I think it was either something to do with spinlocks or bitops. I'll re-investigate tonight and see if I can come back with some benchmarks/code-snippets tomorrow. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
Daniel Pittman [EMAIL PROTECTED] wrote: Hrm. It would seem more sensible to me that the registry, like the GDI, live outside the kernel. This would have some cost in terms of performance, I admit, but I don't think it's significant enough to care. This is, I confess, based on my personal experience with Win32 development and may not represent the real state of affairs for all developers. Having registry access be performance critical or for large data is also against the MicroSoft best practice for software, which makes the costs lower - no multiple calls to ship large quantities of data. :) Anyway, why did you see the need for the registry stubs in this case, if I may ask? Firstly, in wine the registry is handled by the wineserver process. This means it can be shared between all the wine processes running for a particular used. So when wine wants to access the registry, it has to do an RPC call to the wineserver (sending a request across a UNIX socket and then, I think, a bit of ptrace()'ing to get at the data). This provides a hopefully more suitable RPC mechanism. Secondly, the client (wine) just has to issue a syscall that looks exactly like a Win32 registry function, how it is handled is hidden by the kernel. Thirdly, registry functions should issue system handles, as is done on Windows. If system handles move into the kernel, then registry handles should do too. This also means registry change notifications can be implemented by system handles that WaitFor*() functions can deal with. Well... wont the GDI calls be RPC anyway - specifically, across the X11 pipe? Yes, but if I implement them to go to a GDI server, which then talks X11 out the back, that's two lots of RPC calls. However, what I said about system handles and the registry, may also apply to GDI objects. In any case, I would have expected (personally, from assumptions) that the GDI would live inside the process space and pay the same cost as existing X11 applications to do it's display. Yes, that is my current plan. I can't see any sensible reason for implementing anything extra in the kernel to support it. Now, adding an X11 protocol extension to make some of the Win32 stuff work better... Now there's an idea... Have the local X server as an RPC server, handling wine GDI calls directly *grin*. Then I could hide all the GDI stuff behind syscall stubs in the same way. This would allow me to implement a lot of the NT Native API. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
[EMAIL PROTECTED] (Albert D. Cahalan) wrote: In spite of that, it should be considered. It allows this: $ ls -log /proc/self/fd total 0 lrwx--1 acahalan 64 Sep 21 09:12 0 - /dev/pts/4 lrwx--1 acahalan 64 Sep 21 09:12 1 - /dev/pts/4 lrwx--1 acahalan 64 Sep 21 09:12 2 - /dev/pts/4 lrwx--1 acahalan 64 Sep 21 09:12 3 - mutex:[720429] lrwx--1 acahalan 64 Sep 21 09:12 4 - event:[592] lr-x--1 acahalan 64 Sep 21 09:12 5 - /proc/14527/fd There's another minor problem with using fd's as handles... Windows expects the handle number to be a multiple of 4, and sometimes, probably only on rare occasions, uses the bottom two bits to pass extra information. I was flicking through the Native API reference last night, and came across a couple of instances. I don't know whether this ever applies to Win32, though. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Wine speedup through kernel module
"Albert D. Cahalan" [EMAIL PROTECTED] wrote: #define HANDLE_TO_FD(x) ((x)2) #define FD_TO_HANDLE(x) ((x)2) (not quite as simple as that since fd 0 is valid and handle 0 is not, but that's a very minor issue.) I'm still not keen on the idea, though... One of the things I'm trying to avoid is having to maintain a large patch to the kernel. I've done it before with the configuration manager I wrote. At the time, there was a new version of the kernel out at least once a week, and I spent nearly all my time porting the patch to the latest kernels. I can see your point... though it could get very heavy on fd's. There are some advantages to the fd approach: * You could use WaitFor*() on ordinary fds. It could use the internal poll method for non-Win32 fds. * The /proc/*/fd/ driver could be modified to support Win32 handles, probably by extending the file_operations or inode_operations struct. * The dentry stuff has filename support. * Object retension is managed by the VFS. * No need for kernel based fd-handle translation. * You could make the Win32 objects appear in the normal UNIX VFS view of the world by having a new filesystem mounted somewhere appropriate. Also some disadvantages: * The VFS is quite heavy on kernel memory: three layers of structure - file, dentry and inode. The inode structure being a bit of a monster. * The dentry filename support does not handle anonymous files. * The dentry filename support does not handle separate namespaces. * Can't pass initialisation data to the open routine. * Win32 access/share flags would have to be retained in the file struct, and the inode struct would have to maintain a list of these. * A file handle's view of the filename shouldn't get changed even if the file is renamed whilst open. And some problems either way: * The dentry filename support does not handle wide character names or case-independent matching. One other thing: the idea you put forth about looking at /proc/12345/fd/ could be kept, but as /proc/12345/handles instead with a non-fd approach. David Howells - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Wine + kernel ?? How to do that?
[EMAIL PROTECTED] wrote: hey, I hear that wine ( windows emulator ) can port into kernel and make it running faster, How can I do it? or anyone can make a patch to add wine code into kernel? waiting for answer, Thanks I've been writing one to provide all the Windows kernel objects in Linux kernel space (the speed up appears as though it should be impressive). It is, however, not entirely complete yet. You can grab a copy by CVS from the wine repository: export CVSROOT=:pserver:[EMAIL PROTECTED]:/home/wine cvs login (the password is cvs) cvs -z3 checkout kernel-win32 Or you can browse it: http://cvs.winehq.com/cvsweb/kernel-win32/ The numbers are looking good: the system call latency appears to be about half that of Win2000 on the same box! (however, use this number with caution). David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel is unstable
Linus Torvalds [EMAIL PROTECTED] wrote: Also note that the merging tests were not free, so at least under my set of normal load the non-merging code is actually _faster_ than the clever optimized merging. That was what clinched it for me: I absolutely hate to see complexity that doesn't really buy you anything noticeable. Surely, doing the merge will always have take longer than not doing the merge, no matter how finely optimised the algorithm... But merging wouldn't be done very often... only on memory allocation calls. Or do you mean that use of the resultant VMA chain/tree is slower? This I find suprising since after merging the VMA tree should at worst as complex as it was before merging, and fairly likely simpler. Perhaps it'd be reasonable to only do VMA merging on mmap calls and not brk calls, and let brk manually extend an existing VMA if possible. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: patch: test12-pre7 cd stuff
Dec 8 06:05:29 agate kernel: hdc: packet command error: status=0x51 { DriveReady SeekComplete Error } Dec 8 06:05:29 agate kernel: hdc: packet command error: error=0x50 Dec 8 06:05:29 agate kernel: ATAPI device hdc: Dec 8 06:05:29 agate kernel: Error: Illegal request -- (Sense key=0x05) Dec 8 06:05:29 agate kernel: Invalid field in command packet -- (asc=0x24, ascq=0x00) Dec 8 06:05:29 agate kernel: The failed "Play Audio MSF" packet command was: Dec 8 06:05:29 agate kernel: "47 00 00 00 02 00 3f 24 ff 00 00 00 " I think I've seen that as well, though I'm using a test10-ish kernel. I assumed it was something to do with the CD being duff though. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH,preliminary] cleanup shm handling
here is my first shot for cleaning up the shm handling. It did survive some basic testing but is not ready for inclusion. Can you help me with an SHM related problem? I'm currently writing a Win32 emulation kernel module to help speed Wine up, and I'm writing the file mapping support stuff at the moment (CreateFileMapping and MapViewOfFile). I have PE Image mapping just about working (fixups, misaligned file sections and all), but I'm trying to think of a good way of doing anonymous shared mappings without having to hack the main kernel around too much (so far I've only had to add to kernel/ksyms.c). Is there a reasonable way I could hook into the SHM system to "reserve" a chunk of shared memory of a particular size, and then a second hook by which I can "map" _part_ of that into a process's address space? Cheers, David PS: to anyone else reading this, if you wrote do_generic_file_read(), it rocks! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH,preliminary] cleanup shm handling
I'm currently writing a Win32 emulation kernel module to help speed Wine up, ^ fd = shm_open ("xxx",...) ptr = mmap (NULL, size, ..., fd, offset); I am doing this from within kernel space. I'd like to avoid doing the full open and mmap if possible. I was wondering if there're some shortcuts I could make use of. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH,preliminary] cleanup shm handling
If you want to change the vma ops table you can replace the f_ops table with your own one. SYSV ipc uses this also to be able to catch unmaps. I'd thought of that, but that means I need to concoct an f_ops table of my own for every f_ops table I might have to override. All I want to do is provide a single VMA ops table (or maybe two), possibly with only a few ops in. Also, I can't actually go through do_mmap()... PE Image sections in files do not have to be page aligned. If they are, I can call do_mmap() a number of times (once per section), but mostly they're not (they have to be at least 512b aligned though - DOS floppy alignment, I suspect). Plus if I change the f_ops table, then I affect normal Linuxy processes doing mmap(). I'm not sure how shared sections in PE Images are handled on all ... Oh, that's too much Windows for me ;-) *grin* David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [andrea@suse.de: Re: generic rwsem [Re: Alpha process table hang]]
Linus Torvalds [EMAIL PROTECTED] wrote: I think Andrea is right. Although this file seems to be entirely old-fashioned and should never be used, right? I presume you're talking about "include/asm-i386/rwsem-spin.h"... If so, Andrea is right, there is a bug in it (repeated a number of times), though why the tests succeeded, I'm not sure. The file should only be used for the 80386 and maybe early 80486's where CMPXCHG doesn't work properly, everything above that can use the XADD implementation. Also, I _really_ don't see why the code is inlined at all (in the real linux/rwsem-spinlock.h. It shouldn't be. It should be a real function call, and all be done inside lib/rwsem.c inside a #ifdef CONFIG_RWSEM_GENERIC_SPINLOCK or whatever. Andrea seems to have changed his mind on the non-inlining in the generic case. But if you want it totally non-inline, then that can be done. However, whilst developing it, I did notice that that slowed things down, hence why I wanted it kept in line. I have some ideas on how to improve efficiency in that one anyway, based on some a comment from Alan Cox. Please either set me straight, or send me a patch to remove asm-i386/rwsem-spin.h and fix up linux/rwsem-spinlock.h. Ok? I think there are two seperate issues here: (1) asm-i386/rwsem-spin.h is wrong, and can probably be replaced with the generic spinlock implementation without inconveniencing people much. (though someone has commented that they'd want this to be inline as cycles are precious on the slow 80386). (2) "fix up linux/rwsem-spinlock.h": do you want the whole generic spinlock implementation made non-inline then? David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rwsem benchmarks [Re: generic rwsem [Re: Alpha process table hang]]
About the benchmark you wrote it looks good measure to me, thanks. As with all benchmarks, take with one pinch of salt and two of Mindcraft:-) David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
i386 assembly timing issue
I've attached two slightly different bits of i386 assembly that achieve the same end, but in slightly different ways. Can some one tell me why Case 1 is faster than Case 2? Case 1 involves an extra CALL instruction. * Case 1 has a little wrapper function that saves ECX and EDX before calling rwsem_wake(). * Case 2 merges the contents of the wrapper with the caller. Case 1 is what's generated by the rw-semaphore inline assembly code as of 2.4.4-pre5. Case 2 looks like it ought to be a faster version of the same thing. David ### # # CASE 1: registers saved in the rwsem_wake register saving stub # .text .align 16 # # void test_up_read(struct rw_semaphore *sem) # { # up_read(sem); # } # .globl test_up_read .type test_up_read,@function test_up_read: movl4(%esp), %eax movl$-1, %edx xadd%edx,(%eax) js test_up_read_contention test_up_read_done: ret # # Register saving stub for rwsem_wake # .globl __rwsem_wake __rwsem_wake: pushl %edx pushl %ecx callrwsem_wake popl%ecx popl%edx ret # # Contention handler stub for up_read # .section .text.lock,"ax" test_up_read_contention: decl%edx testl $65535,%edx jnz test_up_read_done call__rwsem_wake jmp test_up_read_done ### # # CASE 2: registers saved in the contention handler stub # .text .align 16 # # void test_up_read(struct rw_semaphore *sem) # { # up_read(sem); # } # .globl test_up_read .type test_up_read,@function test_up_read: movl4(%esp), %eax movl$-1, %edx xadd%edx,(%eax) js test_up_read_contention test_up_read_done: ret # # Contention handler stub for up_read # .section .text.lock,"ax" test_up_read_contention: decl%edx testl $65535,%edx jnz test_up_read_done pushl %edx pushl %ecx call__rwsem_wake popl%ecx popl%edx jmp test_up_read_done - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] generic rw_semaphores, compile warnings patch
David S. Miller [EMAIL PROTECTED] wrote: D.W.Howells writes: This patch (made against linux-2.4.4-pre4) gets rid of some warnings obtained when using the generic rwsem implementation. Have a look at pre5, this is already fixed. Not entirely so... There's also a missing "struct rw_semaphore;" declaration in linux/rwsem.h. It needs to go in the gap below "#include linux/wait.h". Otherwise the declarations for the contention handling functions will give warnings about the struct being declared in the parameter list. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
David Woodhouse [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] said: [BTW, another solution is to truly support opaque handles to kernel objects; I believe David Howells is already working on something like this for Wine? Yes. However, it uses a different system call set to use them. They translate to small object structures internally. The poll interface can be trivially extended to support waiting on those...] No, they aren't files. I did not want to use files because this would incur a fairly major penalty for each object: struct file + struct dentry + struct inode Which would mean that Win32 File objects would require two of each, one set to hold the extra Win32 attributes and one set for the actual Linux file. The way I've chosen uses somewhat less memory and should be faster. ISTR it wasn't quite trivial to do it that way - it would require the addition of an extra argument to the fops-poll() method. Yes, the PulseEvent operation demands that all processes currently waiting on the event should be woken, but that no processes attaching immediately afterward get triggered. This means that the PulseEvent handler has to be able to notify all the processes currently waiting on the queue and only those processes. I got it to do this by marking the waiter records each process links into the queue. Oh... and WaitForMultipleObjects also has a wait for all option. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Alon Ziv [EMAIL PROTECTED] wrote: Obviously... since they're handles, not FDs... [BTW, are you using Windows' idea of storing the objects in process space, in a page that's inaccessible to the app itself, and passing pointers into this page as the handles?] No... I grab a page in kernel space and use it as an array. One problem is that if an exit occurs, I have to be able to discard all attached objects after the process's VM has been cleaned up (ie: what if it gets swapped out?). Plus, mmap can clobber existing mappings, MapViewOfFile can't. So what if they aren't files? Small structures private to my Win32 module. I'm afraid I'm not following your logic in this; I believe most Win32 attrs can be mapped to more generic abstractions which should be able to exist at 'struct file' level. Most... It'd mean adding extra fields into struct file (and possibly struct inode) just for the use of this module (which probably wouldn't be accepted). (And even if not, a Win32 file handle could just hold two pointers--- No. the extra data has to be accessible from CreateFile (potentially running in other processes), and this'd mean it'd have to go speculatively searching all Win32 handle tables currently in use. And breaks _completely_ with the existing scheme :-/ So what? This is for a WINE accelerator/Win32 module only. There's already been an argument over making the whole lot available as general Linux functionality, but most people said that it'd be a bad idea because it'd not be portable. Huh? Where did you get this? Looking at my copy of MSDN (July '00), the PulseEvent remarks more-or-less suggest an implementation like SetEvent(e) ResetEvent(e) Consider the following: WAITER 1WAITER 2WAITER 3WAKER wait-on-event wait-on-event wait-on-event sleep sleep sleep PulseEvent set-event wake(WAITER 1) wake(WAITER 2) wake(WAITER 3) reset-event wakewakewake what-happened? what-happened? what-happened? nothing!nothing!nothing! sleep sleep sleep All three waiters should wake up with a note that the event triggered, but they don't. Plus a fourth waiter who begins to wait on the event after the set-event is issue probably shouldn't wake up. I wonder if it's possible to add _just_ this to poll()... No... there's no way to pass this to poll (or select). Better to add a WaitForMultipleObjects() syscall and have that call do_select() with a flag. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]
Ok I finished now my asm optimized rwsemaphores and I improved a little my spinlock based one but without touching the icache usage. And I can break it. There's a very good reason the I changed __up_write() to use CMPXCHG instead of SUBL. I found a sequence of operations that locked up on this. Unfortunately, I can't remember what it was. I do have it written down at home, I think, so I see about sending it to you later. The main advantage of my rewrite are: - my x86 version is visibly faster than yours in the write fast path and it also saves icache because it's smaller That's because your fast write path is wrong. You can get away without clobbering EDX and ECX, which I can't. (read fast path is reproducibly a bit faster too) | +static inline void __down_read(struct rw_semaphore *sem) ... | + : +m (sem-count), +a (sem) From what I've been told, you're lucky here... you avoid a pipeline stall between whatever loads sem into EAX and the INCL that increments what it points to because the +a constraint says that EAX will be changed. This means that the compiler saves EAX before entering the inline asm, thus delaying the INCL and thus avoiding the stall. However, you generally seem to gain this at the expense of clobbering another register, say EDX. So, for a function that just calls down_read twice, I get: 20: 8b 44 24 04 mov0x4(%esp,1),%eax 24: f0 ff 00lock incl (%eax) 27: 0f 88 22 00 00 00 js 4f dr2+0x2f 2d: f0 ff 00lock incl (%eax) 30: 0f 88 30 00 00 00 js 66 dr2+0x46 36: c3 ret And you get: 0: 83 ec 08sub$0x8,%esp 3: 8b 54 24 0c mov0xc(%esp,1),%edx 7: 89 d0 mov%edx,%eax 9: f0 ff 02lock incl (%edx) c: 0f 88 fc ff ff ff js e dr+0xe 12: 89 d0 mov%edx,%eax 14: f0 ff 02lock incl (%edx) 17: 0f 88 0f 00 00 00 js 2c dr2+0xc 1d: 58 pop%eax 1e: 5a pop%edx 1f: c3 ret Note also your asm constraints cause the compiler to eat an extra 8 bytes of stack and then to pop it into registers to get rid of it. This is a gcc bug, and will only hurt if the up_read and down_read are done in separate subroutines. | +static inline void __up_read(struct rw_semaphore *sem) ... | + jnz 1b\n\t | + pushl %%ecx\n\t | + call rwsem_wake\n\t Putting a PUSHL or two there hurt performance when I tried it, because, I'm told, it introduces a pipeline stall. - the common code automatically extends itself to support 2^32 concurrent sleepers on 64bit archs You shouldn't do this in the XADD case, since you are explicitly using 32-bit registers and instructions. Actually, both of our generic cases allow 2^31 sleepers on a 32-bit arch, and by changing mine to a long I can make it so we both support up to 2^63 on a 64-bit arch. However, I suspect that is overkill... - there is no code duplication at all in supporting xchgadd common code logic for other archs (and I prepared a skeleton to fill for the alpha) Why not make it shareable? It doesn't have to be mandatory... So I'd suggest Linus to apply one of my above -8 patches for pre7. (I hope I won't need any secondary try) I recommend against this at the moment (there's a bug in __up_write in his X86 optimised code). David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rw_semaphores, optimisations try #3
Linus Torvalds [EMAIL PROTECTED] wrote: Note that the generic list structure already has support for batching. It only does it for multiple adds right now (see the list_splice merging code), but there is nothing to stop people from doing it for multiple deletions too. The code is something like static inline void list_remove_between(x,y) { n-next = y; y-prev = x; } and notice how it's still just two unconditional stores for _any_ number of deleted entries. Yes but the struct rwsem_waiter batch would have to be entirely deleted from the list before any of them are woken, otherwise the waking processes may destroy their rwsem_waiter blocks before they are dequeued (this destruction is not guarded by a spinlock). This would then reintroduce a second loop to find out which was the last block we would be waking. Anyway, I've already applied your #2, how about a patch relative to that? Attached. David = diff -uNr linux-rwsem-opt2/include/linux/rwsem-spinlock.h linux/include/linux/rwsem-spinlock.h --- linux-rwsem-opt2/include/linux/rwsem-spinlock.h Tue Apr 24 10:51:58 2001 +++ linux/include/linux/rwsem-spinlock.hTue Apr 24 08:40:20 2001 @@ -1,6 +1,8 @@ /* rwsem-spinlock.h: fallback C implementation * * Copyright (c) 2001 David Howells ([EMAIL PROTECTED]). + * - Derived partially from ideas by Andrea Arcangeli [EMAIL PROTECTED] + * - Derived also from comments by Linus */ #ifndef _LINUX_RWSEM_SPINLOCK_H @@ -11,6 +13,7 @@ #endif #include linux/spinlock.h +#include linux/list.h #ifdef __KERNEL__ @@ -19,14 +22,16 @@ struct rwsem_waiter; /* - * the semaphore definition + * the rw-semaphore definition + * - if activity is 0 then there are no active readers or writers + * - if activity is +ve then that is the number of active readers + * - if activity is -1 then there is one active writer + * - if wait_list is not empty, then there are processes waiting for the semaphore */ struct rw_semaphore { - __u32 active; - __u32 waiting; + __s32 activity; spinlock_t wait_lock; - struct rwsem_waiter *wait_front; - struct rwsem_waiter **wait_back; + struct list_headwait_list; #if RWSEM_DEBUG int debug; #endif @@ -42,7 +47,7 @@ #endif #define __RWSEM_INITIALIZER(name) \ -{ 0, 0, SPIN_LOCK_UNLOCKED, NULL, (name).wait_front __RWSEM_DEBUG_INIT } +{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT } #define DECLARE_RWSEM(name) \ struct rw_semaphore name = __RWSEM_INITIALIZER(name) diff -uNr linux-rwsem-opt2/lib/rwsem-spinlock.c linux/lib/rwsem-spinlock.c --- linux-rwsem-opt2/lib/rwsem-spinlock.c Tue Apr 24 10:51:58 2001 +++ linux/lib/rwsem-spinlock.c Tue Apr 24 08:40:20 2001 @@ -2,13 +2,15 @@ * implementation * * Copyright (c) 2001 David Howells ([EMAIL PROTECTED]). + * - Derived partially from idea by Andrea Arcangeli [EMAIL PROTECTED] + * - Derived also from comments by Linus */ #include linux/rwsem.h #include linux/sched.h #include linux/module.h struct rwsem_waiter { - struct rwsem_waiter *next; + struct list_headlist; struct task_struct *task; unsigned intflags; #define RWSEM_WAITING_FOR_READ 0x0001 @@ -19,7 +21,8 @@ void rwsemtrace(struct rw_semaphore *sem, const char *str) { if (sem-debug) - printk([%d] %s({%d,%d})\n,current-pid,str,sem-active,sem-waiting); + printk([%d] %s({%d,%d})\n, + current-pid,str,sem-activity,list_empty(sem-wait_list)?0:1); } #endif @@ -28,11 +31,9 @@ */ void init_rwsem(struct rw_semaphore *sem) { - sem-active = 0; - sem-waiting = 0; + sem-activity = 0; spin_lock_init(sem-wait_lock); - sem-wait_front = NULL; - sem-wait_back = sem-wait_front; + INIT_LIST_HEAD(sem-wait_list); #if RWSEM_DEBUG sem-debug = 0; #endif @@ -48,60 +49,58 @@ */ static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem) { - struct rwsem_waiter *waiter, *next; - int woken, loop; + struct rwsem_waiter *waiter; + int woken; rwsemtrace(sem,Entering __rwsem_do_wake); - waiter = sem-wait_front; - - if (!waiter) - goto list_unexpectedly_empty; - - next = NULL; + waiter = list_entry(sem-wait_list.next,struct rwsem_waiter,list); /* try to grant a single write lock if there's a writer at the front of the queue * - we leave the 'waiting count' incremented to signify potential contention */ if (waiter-flags RWSEM_WAITING_FOR_WRITE) { - sem-active++; - next = waiter-next; + sem-activity = -1
Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]
I'd love to hear this sequence. Certainly regression testing never generated this sequence yet but yes that doesn't mean anything. Note that your slow path is very different than mine. One of my testcases fell over on it... I don't feel the need of any xchg to enforce additional serialization. I don't use XCHG anywhere... do you mean CMPXCHG? yours plays with cmpxchg in a way that I still cannot understand It tries not to let the active count transition 1-0 happen if it can avoid it (ie: it would rather wake someone up and not decrement the count). It also only calls __rwsem_do_wake() if the caller manages to transition the active count 0-1. This avoids another subtle bug that I think I have a sequence written down for too. Infact eax will be changed because it will be clobbered by the slow path, so I have to. Infact you are not using the +a like I do there and you don't save EAX explicitly on the stack I think that's your bug. Not so... my down-failed slowpath functions return sem in EAX. Again it's not a performance issue, the +a (sem) is a correctness issue because the slow path will clobber it. There must be a performance issue too, otherwise our read up/down fastpaths are the same. Which clearly they're not. About the reason I'm faster than you in the down_write fast path is that I can do the subl instead of the cmpxchg, you say this is my big fault, I think my algorithm allows me to do that, but maybe I'm wrong. I used to do that. Unfortunatly I have to put the pushl there because I don't want to save %ecx in the fast path (if I declare ecx clobbered it's even worse no?). It benchmarked faster without it for me. I suspect this will be different on different CPUs anyway. I'm going to have to have a play with Intel's VTUNE program and see what it says. I said on 64bit archs. Of course on x86-64 there is xaddq and the rex registers. But the instructions you've specified aren't 64-bit. It isn't mandatory, if you don't want the xchgadd infrastructure then you don't set even CONFIG_RWSEM_XCHGADD, no? My point is that mine isn't mandatory either... You just don't set the XADD config option. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]
I see what you meant here and no, I'm not lucky, I thought about that. gcc x 2.95.* seems smart enough to produce (%%eax) that you hardcoded when the sem is not a constant (I'm not clobbering another register, if it does it's stupid and I consider this a compiler mistake). It is a compiler mistake... the compiler clobbers another register for you. The compiler does not, however, know about timing issues with the contents of the inline assembly... otherwise it'd stick a delay in front of the XADD in my stuff. I tried with a variable pointer and gcc as I expected generated the (%%eax) but instead when it's a constant like in the bench my way it avoids to stall the pipeline by using the constant address for the locked incl, exactly as you said and that's probably why I beat you on the down read fast path too. (I also benchmarked with a variable semaphore and it was running a little slower) *grin* Fun ain't it... Try it on a dual athlon or P4 and the answer may come out differently. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]
so you reproduced a deadlock with my patch applied, or you are saying you discovered that case with one of you testcases? It was my implementation that triggered it (I haven't tried it with yours), but the bug occurred because the SUBL happened to make the change outside of the spinlocked region in the slowpath at the same time as the wakeup routine was running on the other CPU. I'll have a look at the sequence and make sure that it does actually apply to your implementation. It may not... but it doesn't hurt to check. The thing occurred with one of my simple testcases, but only happened once in a number of runs, fortunately whilst I had the rwsemtrace()'s enabled. yes of course, you use rwsem_cmpxchgw that is unnecessary. Actually, I use this to try and avoid the following loop that you've got in your code: +static void __rwsem_wake(struct rw_semaphore *sem) ... + again: + count = rwsem_xchgadd(-wait-retire, sem-count); + if (!wake_read (count RWSEM_READ_MASK)) { + count = rwsem_xchgadd(wait-retire, sem-count); + if ((count RWSEM_READ_MASK) == 1) + goto again; I now only have that loop in the rwsem_up_write_wake() function. But! In mine, if __up_write()'s CMPXCHG failed, then it has also read the counter, which it then passes as an argument to rwsem_up_write_wake(). This means I can avoid the aforementioned loop in most cases, I suspect, by seeing if the active counter was 1 at the time of the failed CMPXCHG. This also means that if a ex-writer wakes up a writer-to-be, the only atomic instruction performed on sem-count is the CMPXCHG in __up_write(). For the ex-writer waking up readers case, we have to perform an additional XADD, but this must be done before anyone is woken up, else __up_read() can get called to decrement the count before we've incremented it. I count the number of things I want to wake up, adjust the count (one LOCKED ADD only) and then wake the batch up. You dive into a LOCKED XADD/XADD loop for each process you wake, which in the best case will give you one LOCKED XADD per process. Looking again at rwsem_up_read_wake()... I can't actually eliminate the CMPXCHG there because the active count has already been decremented, and so will need to be incremented again prior to a wake-up being performed. However, if the increment was performed and someone else had incremented the count in the meantime, we have to decrement it again... but this can cause a transition back to zero, which we have to check for... and if that occurred... You get the idea, anyway. Oh yes... this idea should be faster on SPARC/SPARC64 and IA64 which don't have useful XADD instructions (FETCHADD can only use small immediate values), only CMPXCHG/CAS are really useful there. I guess I'm faster because I avoid the pipeline stall using +m (sem-count) that is written as a constant, that was definitely intentional idea. +m doesn't avoid the stall. That's just shorthand for: : =m(sem-count) : m(sem-count) which is what mine has. a+ luckily causes the avoidance by saying EAX gets clobbered, causing the compiler to save via an additional register. Note that the compiler saves anyway, even if the register will be discarded after being restored - yuk! I think this one will depend on the surrounding code anyway. I suspect in some chunks of C mine will be faster, and in others yours will be, all depending on when EAX is loaded. My point is that when you set XADD you still force duplication of the header stuff into the the asm/* If you want to use my XADD algorithm, then I think that's going to be fair enough for now. You have to provide some extra routines anyway. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rw_semaphores, optimisations try #3
Linus Torvalds [EMAIL PROTECTED] wrote: - nobody will look up the list because we do have the spinlock at this point, so a destroyed list doesn't actually _matter_ to anybody I suppose that it'll be okay, provided I take care not to access a block for a task I've just woken up. - list_remove_between() doesn't care about the integrity of the entries it destroys. It only uses, and only changes, the entries that are still on the list. True. Okay, I can change it to use that. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rw_semaphores, optimisations try #4
Andrea Arcangeli [EMAIL PROTECTED] wrote: It seems more similar to my code btw (you finally killed the useless chmxchg ;). CMPXCHG ought to make things better by avoiding the XADD(+1)/XADD(-1) loop, however, I tried various combinations and XADD beats CMPXCHG significantly. Here's a quote from Borland assembler manual I managed to dig out, giving i486 timings on memory access: ADDL/SUBL3 cycles XADDL4 cycles CMPXCHG 8 cycles (success) / 10 cycles (failure) LOCK +1 cycle minimum on this CPU In reality, however, XADDL gives at least as good a result as ADDL/SUBL, maybe just a little bit better, but its hard to say. However, the penalty imposed on the other CPU (when it has to flush it's cache) probably more than makes up for the difference. I only had a short low at your attached patch, but the results are quite suspect to my eyes beacuse we should still be equally fast in the fast path and I should still beat you on the write fast path because I do a much faster subl; js while you do movl -1; xadd ; js, while according to your results you beat me on both. Do you have an explanation or you don't know the reason either? MOVL $1,EDX SUBL EDX,(EAX) Works out faster than: SUBL $1,(EAX) as well... probably due to an avoided stall when the instruction before the snippet loads EAX from memory. Oh yes... STC, SUBL may also be faster too. I will re-benchmark the whole thing shortly. But before re-benchmark if you have time could you fix the benchmark to use the variable pointer and send me a new tarball? For your code it probably doesn't matter because you dereference the pointer by hand anyways, but it matters for mine and we want to benchmark real world fast path of course. No, not till this evening now, I'm afraid. As for real-world benchmarks, I suspect the fastpath is going to be sufficiently few cycles that it's drowned out by whatever bit of code is actually using it, like my Wine server module, which is where all this started for me. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] gcc compile-time assertions
I'm looking for comments on an idea I've thrashed out with David Woodhouse, Arjan Van de Ven and Andrew Haley... I've written a patch for gcc to implement compile-time assertions with an eye to making use of this in the kernel in the future (assuming I can get it to be accepted into gcc). One of the main uses I can see for it is for things like udelay() that need their arguments range checking. The current method of doing this is by causing an undefined symbol to be referenced, thereby causing the linker to emit an error that can be hard to trace. The gcc patch can be downloaded: ftp://infradead.org/pub/people/dwh/ctassert.diff Basically, what I've written is a small extension for gcc that implements compile-time assertion checking through a new built in function (this has negligible impact on the rest of the source for gcc). The assertion function takes two arguments, a condition and a message string. The return value is an expression condition!=0. The function would prototype something like: int __builtin_ct_assert(any-type condition, const char message[]) Additionally, if that expression can be evaluated to a constant of zero at compile time, an error will be issued that includes the message string in its text. The main reason I'd like to see this added to gcc is to help improve the Linux kernel's robustness by catching certain conditions at compile time. For instance, Linux's udelay() function (which waits for a number of microseconds up to a limit of 2uS), is implemented in the i386 architecture thus: | extern void __bad_udelay(void); | extern void __udelay(unsigned long usecs); | extern void __const_udelay(unsigned long usecs); | | #define udelay(n) (__builtin_constant_p(n) ? \ | ((n) 2 ? __bad_udelay() : __const_udelay((n) * 0x10c6ul)) : \ | __udelay(n)) This relies on __bad_udelay() getting referenced in the program when n is too large, and causing a linker error, which is quite hard to trace since the kernel build makes heavy use of incremental linking. This can be re-implemented using my gcc patch: | #define udelay(n) ( \ |__builtin_ct_assert((n)=2,udelay() value should be =2uS), \ |__builtin_constant_p(n) ? __const_udelay((n) * 0x10c6ul) : __udelay(n)) And producing an error of the following sort: | test.c:21: compile-time assertion failed: udelay() value should be =2uS Cheers, David Howells - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 2.4.4 alpha semaphores optimization
Hello Ivan, One reason I picked signed long as the count type in the lib/rwsem.c is that this would be 64 bits on a 64-bit arch such as the alpha. So I've taken your idea for include/asm-alpha/rwsem.h and modified it a little. You'll find it attached at the bottom. I don't know whether it will (a) compile, or (b) work... I don't have an alpha to play with. I also don't know the alpha function calling convention, so I can't put direct calls to the fallback routines in lib/rwsem.c from the .subsection 2 bits. Can you do that, or can you tell me how the calling convention works? Cheers, David === #ifndef _ALPHA_RWSEM_H #define _ALPHA_RWSEM_H /* * Written by Ivan Kokshaysky [EMAIL PROTECTED], 2001. * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h */ #ifndef _LINUX_RWSEM_H #error please dont include asm/rwsem.h directly, use linux/rwsem.h instead #endif #ifdef __KERNEL__ #include linux/list.h #include linux/spinlock.h struct rwsem_waiter; extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem); extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem); extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *); /* * the semaphore definition */ struct rw_semaphore { signed long count; #define RWSEM_UNLOCKED_VALUE0x #define RWSEM_ACTIVE_BIAS 0x0001 #define RWSEM_ACTIVE_MASK 0x #define RWSEM_WAITING_BIAS (-0x0001) #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS #define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) spinlock_t wait_lock; struct list_headwait_list; #if RWSEM_DEBUG int debug; #endif }; #if RWSEM_DEBUG #define __RWSEM_DEBUG_INIT , 0 #else #define __RWSEM_DEBUG_INIT /* */ #endif #define __RWSEM_INITIALIZER(name) \ { ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \ LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT } #define DECLARE_RWSEM(name) \ struct rw_semaphore name = __RWSEM_INITIALIZER(name) static inline void init_rwsem(struct rw_semaphore *sem) { sem-count = RWSEM_UNLOCKED_VALUE; spin_lock_init(sem-wait_lock); INIT_LIST_HEAD(sem-wait_list); #if RWSEM_DEBUG sem-debug = 0; #endif } static inline void __down_read(struct rw_semaphore *sem) { signed long oldcount, temp; __asm__ __volatile__( 1: ldq_l %0,%1\n addq %0,%3,%2\n stq_c %2,%1\n beq %2,2f\n #ifdef CONFIG_SMP mb\n #endif .subsection 2\n 2: br 1b\n .previous :=r (oldcount), =m (sem-count), =r (temp) :Ir (RWSEM_ACTIVE_READ_BIAS), m (sem-count) : memory); if (oldcount 0) rwsem_down_read_failed(sem); } static inline void __down_write(struct rw_semaphore *sem) { signed long granted, temp; __asm__ __volatile__( 1: ldq_l %0,%1\n addq %0,%3,%2\n stq_c %2,%1\n beq %2,2f\n #ifdef CONFIG_SMP mb\n cmpeq %0,0,%0\n #endif .subsection 2\n 2: br 1b\n .previous :=r (granted), =m (sem-count), =r (temp) :Ir (RWSEM_ACTIVE_WRITE_BIAS), m (sem-count) : memory); if (!granted) rwsem_down_write_failed(sem); } static inline void __up_read(struct rw_semaphore *sem) { signed long oldcount, temp; __asm__ __volatile__( 1: ldq_l %0,%1\n subq %0,%3,%2\n stq_c %2,%1\n beq %2,2f\n #ifdef CONFIG_SMP mb\n #endif .subsection 2\n 2: br 1b\n .previous :=r (oldcount), =m (sem-count), =r (temp) :Ir (RWSEM_ACTIVE_READ_BIAS), m (sem-count) : memory); if (oldcount 0) if ((count RWSEM_ACTIVE_MASK) == 0) rwsem_wake(sem); } static inline void __up_write(struct rw_semaphore *sem) { signed long count, cmp; __asm__ __volatile__( 1: ldq_l %0,%1\n subq %0,%3,%2\n stq_c %2,%1\n beq %2,2f\n #ifdef CONFIG_SMP mb\n cmpeq %0,%3,%2\n subq %0,%3,%0\n #endif .subsection 2\n 2: br 1b\n .previous :=r (count), =m (sem-count), =r (cmp) :Ir (RWSEM_ACTIVE_WRITE_BIAS), m (sem-count) : memory); if (!cmp) if ((count RWSEM_ACTIVE_MASK) == 0) rwsem_wake(sem); } #define rwsem_atomic_add(val, sem) atomic_add(val, (sem)-count) #define rwsem_atomic_update(val, sem) atomic_add_return(val, (sem)-count) #endif /* __KERNEL__ */ #endif /*
Re: 2.4.4 rwsem make error, please somebody take a look !!
What arch are you compiling for? i386? When I compiled bzImage (using .config from 2.2.3) I got the following errors: Did you run one of the make config commands before building the kernel? You may need to do this to flush the changes from arch/*/config.in into .config. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PTRACE_ATTACH breaks wait4()
I have an idea for getting around this problem, but it's only half implemented at the moment (I use it for implementing a Wine server in the kernel). It involves having a list things called task ornaments attached to each process. Each ornament has a table of event notification methods (so it can be informed about fork, execve, signal and exit events). Signal event notification methods are able to prevent a signal from propegating further down the chain, and the parent's wait handler would be the last element in this list. When a process attaches with ptrace(), it would insert another ornament into this list, before the parent's ornament. This means (a) the process doesn't have to be reparented, and (b) more than one debugger can actually attach to a process (eg: strace and gdb both). This would, however, mean that wait*() would have to not only look at a process's list of children, but also it's list of processes it has ornamented via ptrace David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] I/O Access Abstractions
In conjunction with David Woodhouse ([EMAIL PROTECTED]) and Arjan Van De Ven ([EMAIL PROTECTED]), I've come up with a way to abstract I/O accesses in the Linux kernel whilst trying to keep overheads minimal. These would be particularly useful on many non-i386 platforms. Any comments would be greatly appreciated. The example code I've written can be downloaded from: ftp://infradead.org/pub/people/dwh/ioaccess.tar.bz2 Cheers, David Howells [EMAIL PROTECTED] ___ I/O ACCESS ABSTRACTIONS === PURPOSES * Makes all peripheral input and output accesses look the same. - I/O ports - Memory-mapped I/O registers - PCI Configuration space - Device Specific registers * Can hide all the strange hoops that some architectures have to jump through to provide certain bus access services. For example: (1) The SH arch has no I/O port space in the same way that the i386 bus does. It instead maps a window in the physical memory space to PCI IO port accesses. (2) The AM33 arch also has no I/O port space, and likewise maps a window in memory space to PCI IO port addresses. Furthermore, the PCI _memory_ space doesn't correspond on 1:1 basis with the CPU's memory space. At any one time, a 64Mb window is mapped from a fixed location in the CPU memory space into a mobile location on the PCI memory space. This can be moved by means of a control register on the host bridge. * Can hide the exact details of the layout of a devices register space as it varies from arch to arch. For instance, on the PC, a the 16550 compatible serial ports have 8 1-byte registers adjacent to each other in the I/O port space, whereas another arch may have the same 8 1-byte registers but at intervals of 4-bytes in memory space. * The operations by which a device can be accessed can be bound at runtime rather than at compile time without the need for conditional branches (this should simplify serial.c immensely). * Permit iotrace on specific resources by operation table substitution. * Provide a method of device emulation. * Potentially permit transparent byte swapping. DISADVANTAGES = * Indirection. It will incur an overhead of a few extra cycles on the i386 (for the call and return). This can be minimised by carefully crafting bits of inline assembly and carefully choosing what registers are used to pass what values. This may also be offset under certain circumstances by the removal of conditional branches (for example in serial.c). Also on an i386, the actual I/O instruction itself is going to take a comparatively long time anyway, given the speed differential between CPU and external buses. On other archs where the indirection exists anyway, there shouldn't be any overhead IMPLEMENTATION == * The resource structure gains a pointer to a table of operations. * The table of resource operations includes 3 single-input functions, 3 single-output functions, 3 string-input functions and 3 string-output functions. Each set of functions includes variation for byte, word and dword granularity. * Inline functions are provided on a per-arch basis that take a resource structure amongst their arguments, dereference it to pull out the appropriate function address, set up the registers and then call that function. For example: struct resource *csr; __u32 x = inputw(csr,0x10); outputw(cst,0x10,x|0x0020); * Macros are provided for generating and using calling convention translation stubs for making it possible to write operation functions in C (which if necessary will be assembly functions): __u16 __iocall pcnet32_inputw(struct resource *p, unsigned offset); __DECLARE_INPUT_CALLBACK(w,pcnet32_inputw); struct resource_ops pcnet32_ops = { __USE_INPUT_CALLBACK(b,pcnet32_inputb), __USE_INPUT_CALLBACK(w,pcnet32_inputw), __USE_INPUT_CALLBACK(l,pcnet32_inputl), }; Note that on most archs (where there are enough registers) these macros will do nothing but plug the user-supplied function straight in. * There are three ops tables supplied for the i386: - I/O port accesses [ioaccess-io.S] - memory-mapped I/O accesses [ioaccess-mem.S] - PCI type 1 config space accesses [ioaccess-pciconf1.S] Note that the PCI access example does not have the complete set of functions at the moment. EXTENSIONS == * Driver-specific operations can be provided (for example CSR register access functions in the AMD PCnet32 driver). ISSUES == Having discussed this with others the following points have been raised: * Some i386 support routines
[BUG] directory renaming/removal
Run the following script (It's been tried on linux-2.2.x and linux-2.4.x): #!/bin/sh cd /tmp mkdir x cd x mkdir x y z strace -etrace=rename,mkdir,rmdir,chmod mv x z echo - chmod -w y strace -etrace=rename,mkdir,rmdir,chmod mv y z The output: rename("x", "z/x") = 0 - rename("y", "z/y") = -1 EACCES (Permission denied) mkdir("z/y", 040755)= 0 chmod("z/y", 040555)= 0 rmdir("y") = 0 You'll notice the following: (1) Linux can't rename directories that are marked as read-only. This is strange because the directories actually being modified _do_ have write permission. (2) You can _remove_ a read-only directory. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: PCI-SCI Drivers v1.1-7 released
More to add on the gcc 2.96 problems. After compiling a Linux 2.4.1 kernel on gcc 2.91, running SCI benchmarks, then compiling on RedHat 7.1 (Fischer) with gcc 2.96, the 2.96 build DROPPED 30% in throughput from the gcc 2.91 compiled version on the identical SAME 2.4.1 source tree. Out of interest, could you run your benchmark test against a "latest snapshot build" of gcc? http://www.codesourcery.com/gcc-snapshots/ Cheers, David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds [EMAIL PROTECTED] wrote: Actually, I'd rather leave it in, but speed it up with the saner and faster if (bh-b_size (correct_size-1)) { I presume that correct_size will always be a power of 2... David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.x SMP blamed for Xfree 4.0 crashes
I had problems with XFree86 4.0 and 4.0.1 locking solidly up under Linux 2.4.x after about 10mins until I upgraded to XFree86 4.0.2. Now it seems rock solid. XFree86 3.3.x was always okay. I've got a Dual-PII machine and an NVidia GeForce. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
*grin* Windows 2000 HPC: Scalable, Inexpensive Supercomputing Solutions
How this for a laugh: http://www.microsoft.com/WINDOWS2000/hpc/indstand.asp David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [LONG RANT] Re: Linux stifles innovation...
I suspect part of the problem with commercial driver support on Linux is that the Linux driver API (such as it is) is relatively poorly documented and seems to change almost on a week-by-week basis anyway. I've done my share of chasing the current kernel revision with drivers that aren't part of the kernel tree: by the time you update the driver to work with the current kernel revision, there's a new one out, and the driver doesn't compile with it. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
rw_semaphore bug
I've found a bug in the write path of the read/write semaphore stuff (at least for the i386 arch). The attached kernel module (rwsem.c) and driver program (driver.c) demonstrate it. What happens is that once the driver finishes, you end up with a whole load of processes forked off of driver that are stuck in the D state and are waiting in down_write_failed(), even though the semaphore is in the unlocked state. #define __NO_VERSION__ #include linux/config.h #include linux/module.h #include linux/poll.h #include linux/proc_fs.h #include linux/stat.h #include linux/init.h #include linux/personality.h #include linux/smp_lock.h #include linux/delay.h struct proc_dir_entry *rwsem_proc; struct rw_semaphore rwsem_sem; static int rwsem_write_proc(struct file *file, const char *buffer, unsigned long count, void *data) { printk("[%05d %08x]: downing\n",current-pid,rwsem_sem.count); down_write(rwsem_sem); printk("[%05d %08x]: downed\n",current-pid,rwsem_sem.count); mdelay(1); printk("[%05d %08x]: upping\n",current-pid,rwsem_sem.count); up_write(rwsem_sem); printk("[%05d %08x]: upped\n",current-pid,rwsem_sem.count); mdelay(100); printk("[%05d %08x]: downing\n",current-pid,rwsem_sem.count); down_write(rwsem_sem); printk("[%05d %08x]: downed\n",current-pid,rwsem_sem.count); mdelay(1); printk("[%05d %08x]: upping\n",current-pid,rwsem_sem.count); up_write(rwsem_sem); printk("[%05d %08x]: upped\n",current-pid,rwsem_sem.count); return -ENOENT; } static int __init rwsem_init_module(void) { printk(KERN_INFO "rwsem loading...\n"); init_rwsem(rwsem_sem); /* try and create the access point */ rwsem_proc = create_proc_entry("rwsem",S_IRUGO|S_IWUGO,NULL); if (!rwsem_proc) return -EEXIST; rwsem_proc-write_proc = rwsem_write_proc; return 0; } static void __exit rwsem_cleanup_module(void) { printk(KERN_INFO "rwsem unloading\n"); remove_proc_entry("rwsem",NULL); } module_init(rwsem_init_module); module_exit(rwsem_cleanup_module); #include stdio.h #include stdlib.h #include unistd.h #include sys/wait.h #include fcntl.h int main(int argc, char *argv[]) { int loop, fd, tmp; fd = open("/proc/rwsem",O_RDWR); if (fd0) { perror("open"); return 1; } for (loop=0; loop50; loop++) { switch (fork()) { case -1: perror("fork"); return 1; case 0: write(fd," ",1); exit(1); default: break; } } for (loop=0; loop5; loop++) { if (wait(tmp)0) { perror("wait"); return 1; } } return 0; }
Re: rw_semaphores
Since you're willing to use CMPXCHG in your suggested implementation, would it make it make life easier if you were willing to use XADD too? Plus, are you really willing to limit the number of readers or writers to be 32767? If so, I think I can suggest a way that limits it to ~65535 apiece instead... David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] i386 rw_semaphores fix
Here's a patch that fixes RW semaphores on the i386 architecture. It is very simple in the way it works. The lock counter is dealt with as two semi-independent words: the LSW is the number of active (granted) locks, and the MSW, if negated, is the number of active writers (0 or 1) plus the number of waiting lockers of any type. The fast paths are either two or three instructions long. This algorithm should also be totally fair! Contentious lockers get put on the back of the wait queue, and a waker function wakes them starting at the front, but only wakes either one writer or the first consecutive bundle of readers. The disadvantage is that the maximum number of active locks is 65535, and the maximum number of waiting locks is 32766 (this can be extended to 65534 by not relying on the S flag). I've included a revised testing module (rwsem.c) that allows read locks to be obtained as well as write locks and a revised driver program (driver.c) that can use rwsem.c. Try the following tests: driver -200 driver 200 # fork 200 writers and then 200 readers driver 200 driver -200 # fork 200 readers and then 200 writers David Howells diff -uNr linux-2.4.3/arch/i386/kernel/semaphore.c linux/arch/i386/kernel/semaphore.c --- linux-2.4.3/arch/i386/kernel/semaphore.cThu Apr 5 14:38:34 2001 +++ linux/arch/i386/kernel/semaphore.c Tue Apr 10 18:23:55 2001 @@ -14,7 +14,6 @@ */ #include linux/config.h #include linux/sched.h - #include asm/semaphore.h /* @@ -237,19 +236,10 @@ __down_read_failed: pushl %edx pushl %ecx - jnc 2f - -3: calldown_read_failed_biased - -1: popl%ecx + calldown_read_failed + popl%ecx popl%edx ret - -2: calldown_read_failed - " LOCK "subl$1,(%eax) - jns 1b - jnc 2b - jmp 3b " ); @@ -260,169 +250,204 @@ __down_write_failed: pushl %edx pushl %ecx - jnc 2f - -3: calldown_write_failed_biased - -1: popl%ecx + calldown_write_failed + popl%ecx popl%edx ret - -2: calldown_write_failed - " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax) - jz 1b - jnc 2b - jmp 3b " ); -struct rw_semaphore *FASTCALL(rwsem_wake_readers(struct rw_semaphore *sem)); -struct rw_semaphore *FASTCALL(rwsem_wake_writer(struct rw_semaphore *sem)); +asm( +" +.align 4 +.globl __rwsem_wake +__rwsem_wake: + pushl %edx + pushl %ecx + callrwsem_wake + popl%ecx + popl%edx + ret +" +); -struct rw_semaphore *FASTCALL(down_read_failed_biased(struct rw_semaphore *sem)); -struct rw_semaphore *FASTCALL(down_write_failed_biased(struct rw_semaphore *sem)); +struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *sem)); struct rw_semaphore *FASTCALL(down_read_failed(struct rw_semaphore *sem)); struct rw_semaphore *FASTCALL(down_write_failed(struct rw_semaphore *sem)); -struct rw_semaphore *down_read_failed_biased(struct rw_semaphore *sem) -{ - struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); - add_wait_queue(sem-wait, wait); /* put ourselves at the head of the list */ - - for (;;) { - if (sem-read_bias_granted xchg(sem-read_bias_granted, 0)) - break; - set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (!sem-read_bias_granted) - schedule(); - } +static inline int atomic_add_and_read_orig(int delta, atomic_t *v) +{ + int oldmem; - remove_wait_queue(sem-wait, wait); - tsk-state = TASK_RUNNING; + __asm__ __volatile__( + LOCK_PREFIX "xadd %0,%2" + : "=r"(oldmem) + : "r0"(delta), "m"(*__xg(v)) + : "memory"); - return sem; + return oldmem; } -struct rw_semaphore *down_write_failed_biased(struct rw_semaphore *sem) +static inline int atomic_add_and_read(int delta, atomic_t *v) { - struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); - - add_wait_queue_exclusive(sem-write_bias_wait, wait); /* put ourselves at the end of the list */ - - for (;;) { - if (sem-write_bias_granted xchg(sem-write_bias_granted, 0)) - break; - set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (!sem-write_bias_granted) - schedule(); - } - - remove_wait_queue(sem-write_bias_wait, wait); - tsk-state = TASK_RUNNING; + return atomic_add_and_read_orig(delta,v)+delta; +} - /* if the lock is currently unbiased, awaken the sleepers -* FIXME: this wakes up the readers early in a bit of a -* stampede - bad! -
Re: [PATCH] i386 rw_semaphores fix
Can CONFIG_X86_XADD be equated to CONFIG_X86_CMPXCHG? David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 2nd try: i386 rw_semaphores fix
usive. +/* + * wait for the write lock to be granted */ struct rw_semaphore *down_write_failed(struct rw_semaphore *sem) { struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); + DECLARE_WAITQUEUE(wait,tsk); + int count; - __up_write(sem);/* this takes care of granting the lock */ + rwsemdebug("[%d] Entering down_write_failed(%08x)\n", + current-pid,atomic_read(sem-count)); - add_wait_queue_exclusive(sem-wait, wait); + /* this waitqueue context flag will be cleared when we are granted the lock */ + __set_bit(RWSEM_WAITING_FOR_WRITE,wait.flags); + + add_wait_queue_exclusive(sem-wait, wait); /* FIFO */ - while (atomic_read(sem-count) 0) { + /* note that we're waiting on the lock, but no longer actively locking */ + count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem); + rwsemdebug("A(%08x)\n",count); + + /* if there are no longer active locks, wake the front queued process(es) up +* - it might even be this process, since the waker takes a more active part +*/ + if (!(count RWSEM_ACTIVE_MASK)) + rwsem_wake(sem); + + /* wait to be given the lock */ + for (;;) { set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (atomic_read(sem-count) = 0) - break; /* we must attempt to acquire or bias the lock */ + if (!test_bit(RWSEM_WAITING_FOR_WRITE,wait.flags)) + break; schedule(); } - remove_wait_queue(sem-wait, wait); + remove_wait_queue(sem-wait,wait); tsk-state = TASK_RUNNING; + rwsemdebug("[%d] Leaving +down_write_failed(%08x)\n",current-pid,atomic_read(sem-count)); + return sem; } -asm( -" -.align 4 -.globl __rwsem_wake -__rwsem_wake: - pushl %edx - pushl %ecx - - jz 1f - callrwsem_wake_readers - jmp 2f +/* + * handle the lock being released whilst there are processes blocked on it that can +now run + * - if we come here, then: + * - the 'active part' of the count (0x) reached zero (but may no longer +be zero) + * - the 'waiting part' of the count (0x) is negative (and will still be +so) + */ +struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) +{ + int woken; -1: callrwsem_wake_writer + rwsemdebug("[%d] Entering +rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count)); -2: popl%ecx - popl%edx - ret -" -); + /* try to grab an 'activity' marker +* - need to make sure two copies of rwsem_wake() don't do this for two +separate processes +* simultaneously +* - be horribly naughty, and only deal with the LSW of the atomic counter +*/ + if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0) + goto out; -/* Called when someone has done an up that transitioned from - * negative to non-negative, meaning that the lock has been - * granted to whomever owned the bias. - */ -struct rw_semaphore *rwsem_wake_readers(struct rw_semaphore *sem) -{ - if (xchg(sem-read_bias_granted, 1)) + /* try to grant a single write lock if there's a writer at the front of the +queue +* - note we leave the 'active part' of the count incremented by 1 and the +waiting part +* incremented by 0x0001 +*/ + switch (wake_up_ctx(sem-wait,1,-RWSEM_WAITING_FOR_WRITE)) { + case 1: + goto out; + case 0: + break; + default: BUG(); - wake_up(sem-wait); - return sem; -} + } -struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem) -{ - if (xchg(sem-write_bias_granted, 1)) + rwsemdebug("E\n"); + + /* grant an infinite number of read locks to the readers at the front of the +queue +* - note we increment the 'active part' of the count by the number of readers +just woken, +* less one for the activity decrement we've already done +*/ + woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ); + if (woken0) { + woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS; + woken -= RWSEM_ACTIVE_BIAS; + rwsem_atomic_update(woken,sem); + } + else { BUG(); - wake_up(sem-write_bias_wait); + } + + out: + rwsemdebug("[%d] Leaving +rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count)); return sem; } +/* + * rw spinlock fallbacks + */ #if defined(CONFIG_SMP) asm( " @@ -451,4 +539,3 @@ " ); #endif - diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h linux/include/asm-i386/rwsem-spin.h --- linux-2.4.3/include/asm-i386/rwsem-spin.h Thu Jan 1 01:00:00 1970 +++ linux/include/asm-i386
Re: [PATCH] 2nd try: i386 rw_semaphores fix
I'd like you to look over it. It seems newer GCC's (snapshots and the upcoming 3.0) will be more strict when modifying some values through assembler-passed pointers - in this case, the passed semaphore structure got freed too early, causing massive stack corruption on early bootup. The solution was to directly mention the modified element (in this case, sem-count) with a "=m" qualifier, which told GCC that the contents of the semaphore structure are still really needed. It does not seem to have any bad side effects on older GCC, but lets the code work on people trying to use the newer snapshots. I've just consulted with one of the gcc people we have here, and he says that the '"memory"' constraint should do the trick. Do I take it that that is actually insufficient? David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 2nd try: i386 rw_semaphores fix
I've been discussing it with some other kernel and GCC people, and they think that only "memory" is required. What are the reasons against mentioning sem-count directly as a "=m" reference? This makes the whole thing less fragile and no more dependent on the memory layout of the structure. Apart from the risk of breaking it, you mean? Well, "=m" seems to reserve an extra register to hold a second copy of the semaphore address, probably since it thinks EAX might get clobbered. Also, as a minor point, it probably ought to be "+m" not "=m". David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 3rd try: i386 rw_semaphores fix
uct rw_semaphore *down_write_failed(struct rw_semaphore *sem) { struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); + DECLARE_WAITQUEUE(wait,tsk); + int count; - __up_write(sem);/* this takes care of granting the lock */ + rwsemdebug("[%d] Entering down_write_failed(%08x)\n", + current-pid,atomic_read(sem-count)); - add_wait_queue_exclusive(sem-wait, wait); + /* this waitqueue context flag will be cleared when we are granted the lock */ + __set_bit(RWSEM_WAITING_FOR_WRITE,wait.flags); + + add_wait_queue_exclusive(sem-wait, wait); /* FIFO */ - while (atomic_read(sem-count) 0) { + /* note that we're waiting on the lock, but no longer actively locking */ + count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem); + rwsemdebug("A(%08x)\n",count); + + /* if there are no longer active locks, wake the front queued process(es) up +* - it might even be this process, since the waker takes a more active part +*/ + if (!(count RWSEM_ACTIVE_MASK)) + rwsem_wake(sem); + + /* wait to be given the lock */ + for (;;) { set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (atomic_read(sem-count) = 0) - break; /* we must attempt to acquire or bias the lock */ + if (!test_bit(RWSEM_WAITING_FOR_WRITE,wait.flags)) + break; schedule(); } - remove_wait_queue(sem-wait, wait); + remove_wait_queue(sem-wait,wait); tsk-state = TASK_RUNNING; + rwsemdebug("[%d] Leaving +down_write_failed(%08x)\n",current-pid,atomic_read(sem-count)); + return sem; } -asm( -" -.align 4 -.globl __rwsem_wake -__rwsem_wake: - pushl %edx - pushl %ecx - - jz 1f - callrwsem_wake_readers - jmp 2f +/* + * handle the lock being released whilst there are processes blocked on it that can +now run + * - if we come here, then: + * - the 'active part' of the count (0x) reached zero (but may no longer +be zero) + * - the 'waiting part' of the count (0x) is negative (and will still be +so) + */ +struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) +{ + int woken; -1: callrwsem_wake_writer + rwsemdebug("[%d] Entering +rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count)); -2: popl%ecx - popl%edx - ret -" -); + /* try to grab an 'activity' marker +* - need to make sure two copies of rwsem_wake() don't do this for two +separate processes +* simultaneously +* - be horribly naughty, and only deal with the LSW of the atomic counter +*/ + if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0) + goto out; -/* Called when someone has done an up that transitioned from - * negative to non-negative, meaning that the lock has been - * granted to whomever owned the bias. - */ -struct rw_semaphore *rwsem_wake_readers(struct rw_semaphore *sem) -{ - if (xchg(sem-read_bias_granted, 1)) + /* try to grant a single write lock if there's a writer at the front of the +queue +* - note we leave the 'active part' of the count incremented by 1 and the +waiting part +* incremented by 0x0001 +*/ + switch (wake_up_ctx(sem-wait,1,-RWSEM_WAITING_FOR_WRITE)) { + case 1: + goto out; + case 0: + break; + default: BUG(); - wake_up(sem-wait); - return sem; -} + } -struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem) -{ - if (xchg(sem-write_bias_granted, 1)) + rwsemdebug("E\n"); + + /* grant an infinite number of read locks to the readers at the front of the +queue +* - note we increment the 'active part' of the count by the number of readers +just woken, +* less one for the activity decrement we've already done +*/ + woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ); + if (woken0) { + woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS; + woken -= RWSEM_ACTIVE_BIAS; + rwsem_atomic_update(woken,sem); + } + else { BUG(); - wake_up(sem-write_bias_wait); + } + + out: + rwsemdebug("[%d] Leaving +rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count)); return sem; } +/* + * rw spinlock fallbacks + */ #if defined(CONFIG_SMP) asm( " @@ -451,4 +539,3 @@ " ); #endif - diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h linux-rwsem/include/asm-i386/rwsem-spin.h --- linux-2.4.3/include/asm-i386/rwsem-spin.h Thu Jan 1 01:00:00 1970 +++ linux-rwsem/include/asm-i386/rwsem-spin.h Wed Apr 11 17:18:07 2001 @@ -0,0 +1,228 @@ +/
Re: [PATCH] i386 rw_semaphores fix
Andrew Morton wrote: I think that's a very good approach. Sure, it's suboptimal when there are three or more waiters (and they're the right type and order). But that never happens. Nice design idea. Cheers. These numbers are infinity :) I know, but I think Linus may be happy with the resolution for the moment. It can be extended later by siphoning off excess quantities of waiters into a separate counter (as is done now) and by making the access count use a larger part of the variable. Unfortunately, managing the count and siphoned-off count together is tricky. You need sterner testing stuff :) I hit the BUG at the end of rwsem_wake() in about a second running rwsem-4. Removed the BUG and everything stops in D state. Grab rwsem-4 from ... Will do. It's very simple. But running fully in-kernel shortens the code paths enormously and allows you to find those little timing windows. I thought I'd got them all by using an activity counter incremented by both read and write lockers. - rwsemdebug(FMT, ...) doesn't compile with egcs-1.1.2. Need to remove the comma. This is tricky... you get all sorts of horrible warnings with gcc-2.96 if you remove the comma. What I've done is now ANSI-C99 compliant, but egcs is not. - The comments in down_write and down_read() are inaccurate. RWSEM_ACTIVE_WRITE_BIAS is 0x0001, not 0x00010001 Done. - It won't compile when WAITQUEUE_DEBUG is turned on. I guess you knew that. Currently putting in separate debugging stuff for rwsems. - The comments above the functions in semaphore.h need updating. Done. (BTW in the latest patch, they're actually split out into separate header files as per Linus's suggestion). - What on earth does __xg() do? (And why do people write code like that without explaining why? Don't answer this one). Stolen from the xchg() macro/function, but I'm not sure what it does. Plus I don't use it now. - Somewhat offtopic: the `asm' statements in semaphore.c are really dangerous. Now all got .text in. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] i386 rw_semaphores fix
You need sterner testing stuff :) I hit the BUG at the end of rwsem_wake() in about a second running rwsem-4. Removed the BUG and everything stops in D state. Grab rwsem-4 from http://www.uow.edu.au/~andrewm/linux/rwsem.tar.gz It's very simple. But running fully in-kernel shortens the code paths enormously and allows you to find those little timing windows. Run rmsem-4 in two modes: one with the schedule() in sched() enabled, and also with it commented out. If it passes that, it works. When you remove the module it'll print out the number of read-grants versus write-grants. If these run at 6:1 with schedule() disabled then you've kicked butt. Also, rwsem-4 checks that the rwsems are actually providing exclusion between readers and writers, and between writers and writers. A useful thing to check, that. It now works (patch to follow). schedule() enabled: reads taken: 686273 writes taken: 193414 schedule() disabled: reads taken: 585619 writes taken: 292997 David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 4th try: i386 rw_semaphores fix
art' of the count by the number of readers +just woken, +* less one for the activity decrement we've already done +*/ + woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ); + if (woken=0) + goto counter_correction; + + woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS; + woken -= RWSEM_ACTIVE_BIAS; + rwsem_atomic_update(woken,sem); -struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem) -{ - if (xchg(sem-write_bias_granted, 1)) - BUG(); - wake_up(sem-write_bias_wait); + out: + rwsemdebug("[%d] Leaving +rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count)); return sem; + + /* come here if we need to correct the counter for odd SMP-isms */ + counter_correction: + count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem); + rwsemdebug("[%d] corrected(%08x)\n",current-pid,count); + if (!(count RWSEM_ACTIVE_MASK)) + goto try_again; + goto out; } +/* + * rw spinlock fallbacks + */ #if defined(CONFIG_SMP) asm( " @@ -451,4 +549,3 @@ " ); #endif - diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h linux-rwsem/include/asm-i386/rwsem-spin.h --- linux-2.4.3/include/asm-i386/rwsem-spin.h Thu Jan 1 01:00:00 1970 +++ linux-rwsem/include/asm-i386/rwsem-spin.h Wed Apr 11 22:28:32 2001 @@ -0,0 +1,239 @@ +/* rwsem.h: R/W semaphores based on spinlocks + * + * Written by David Howells ([EMAIL PROTECTED]). + * + * Derived from asm-i386/semaphore.h and asm-i386/spinlock.h + */ + +#ifndef _I386_RWSEM_SPIN_H +#define _I386_RWSEM_SPIN_H + +#ifdef __KERNEL__ + +#define CONFIG_USING_SPINLOCK_BASED_RWSEM 1 + +/* + * the semaphore definition + */ +struct rw_semaphore { + atomic_tcount; +#define RWSEM_UNLOCKED_VALUE 0x +#define RWSEM_ACTIVE_BIAS 0x0001 +#define RWSEM_ACTIVE_MASK 0x +#define RWSEM_WAITING_BIAS (-0x0001) +#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS +#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + +RWSEM_ACTIVE_BIAS) + spinlock_t lock; +#define RWSEM_SPINLOCK_OFFSET_STR "4" /* byte offset of spinlock */ + wait_queue_head_t wait; +#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0 /* bits to use in +wait_queue_t.flags */ +#define RWSEM_WAITING_FOR_WRITEWQ_FLAG_CONTEXT_1 +#if RWSEM_DEBUG + int debug; +#endif +#if RWSEM_DEBUG_MAGIC + long__magic; + atomic_treaders; + atomic_twriters; +#endif +}; + +/* + * initialisation + */ +#if RWSEM_DEBUG +#define __RWSEM_DEBUG_INIT , 0 +#else +#define __RWSEM_DEBUG_INIT /* */ +#endif +#if RWSEM_DEBUG_MAGIC +#define __RWSEM_DEBUG_MINIT(name) , (int)(name).__magic, ATOMIC_INIT(0), +ATOMIC_INIT(0) +#else +#define __RWSEM_DEBUG_MINIT(name) /* */ +#endif + +#define __RWSEM_INITIALIZER(name,count) \ +{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \ + __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ + __RWSEM_DEBUG_INIT __RWSEM_DEBUG_MINIT(name) } + +#define __DECLARE_RWSEM_GENERIC(name,count) \ + struct rw_semaphore name = __RWSEM_INITIALIZER(name,count) + +#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS) +#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1) +#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0) + +static inline void init_rwsem(struct rw_semaphore *sem) +{ + atomic_set(sem-count, RWSEM_UNLOCKED_VALUE); + spin_lock_init(sem-lock); + init_waitqueue_head(sem-wait); +#if RWSEM_DEBUG + sem-debug = 0; +#endif +#if RWSEM_DEBUG_MAGIC + sem-__magic = (long)sem-__magic; + atomic_set(sem-readers, 0); + atomic_set(sem-writers, 0); +#endif +} + +/* + * lock for reading + */ +static inline void __down_read(struct rw_semaphore *sem) +{ + __asm__ __volatile__( + "# beginning down_read\n\t" +#ifdef CONFIG_SMP +LOCK_PREFIX" decb "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab +the spinlock */ + " js3f\n" /* jump if failed */ + "1:\n\t" +#endif + " incl (%%eax)\n\t" /* adds 0x0001, returns the old value */ +#ifdef CONFIG_SMP + " movb $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the +spinlock */ +#endif + " js4f\n\t" /* jump if we weren't granted the lock */ + "2:\n" + ".section .text.lock,\"ax\"\n" +#ifdef CONFIG_SMP + "3:\n\t" /* spin on the spinlock till we get it */ + " cmpb $0,"RWSEM_SPINLOCK_OFFSET_STR"
[PATCH] i386 rw_semaphores, general abstraction patch
- - return sem; -} - -/* - * handle the lock being released whilst there are processes blocked on it that can now run - * - if we come here, then: - * - the 'active part' of the count (0x) reached zero (but may no longer be zero) - * - the 'waiting part' of the count (0x) is negative (and will still be so) - */ -struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem) -{ - int woken, count; - - rwsemdebug("[%d] Entering rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count)); - - try_again: - /* try to grab an 'activity' marker -* - need to make sure two copies of rwsem_wake() don't do this for two separate processes -* simultaneously -* - be horribly naughty, and only deal with the LSW of the atomic counter -*/ - if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0) { - rwsemdebug("[%d] rwsem_wake: abort wakeup due to renewed activity\n",current-pid); - goto out; - } - - /* try to grant a single write lock if there's a writer at the front of the queue -* - note we leave the 'active part' of the count incremented by 1 and the waiting part -* incremented by 0x0001 -*/ - if (wake_up_ctx(sem-wait,1,-RWSEM_WAITING_FOR_WRITE)==1) - goto out; - - /* grant an infinite number of read locks to the readers at the front of the queue -* - note we increment the 'active part' of the count by the number of readers just woken, -* less one for the activity decrement we've already done -*/ - woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ); - if (woken=0) - goto counter_correction; - - woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS; - woken -= RWSEM_ACTIVE_BIAS; - rwsem_atomic_update(woken,sem); - - out: - rwsemdebug("[%d] Leaving rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count)); - return sem; - - /* come here if we need to correct the counter for odd SMP-isms */ - counter_correction: - count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem); - rwsemdebug("[%d] corrected(%08x)\n",current-pid,count); - if (!(count RWSEM_ACTIVE_MASK)) - goto try_again; - goto out; -} - /* * rw spinlock fallbacks */ diff -uNr linux-2.4.4-pre2/arch/i386/lib/Makefile linux/arch/i386/lib/Makefile --- linux-2.4.4-pre2/arch/i386/lib/Makefile Thu Apr 12 08:57:23 2001 +++ linux/arch/i386/lib/MakefileThu Apr 12 10:07:33 2001 @@ -9,7 +9,7 @@ obj-y = checksum.o old-checksum.o delay.o \ usercopy.o getuser.o putuser.o \ - memcpy.o strstr.o + memcpy.o strstr.o rwsem.o obj-$(CONFIG_X86_USE_3DNOW) += mmx.o obj-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o diff -uNr linux-2.4.4-pre2/arch/i386/lib/rwsem.S linux/arch/i386/lib/rwsem.S --- linux-2.4.4-pre2/arch/i386/lib/rwsem.S Thu Jan 1 01:00:00 1970 +++ linux/arch/i386/lib/rwsem.S Thu Apr 12 15:49:25 2001 @@ -0,0 +1,36 @@ +/* rwsem.S: R/W semaphores, register saving wrapper function stubs + * + * Written by David Howells ([EMAIL PROTECTED]). + * Derived from arch/i386/kernel/semaphore.c + */ + +.text +.align 4 +.globl __rwsem_down_read_failed +__rwsem_down_read_failed: + pushl %edx + pushl %ecx + callrwsem_down_read_failed + popl%ecx + popl%edx + ret + +.align 4 +.globl __rwsem_down_write_failed +__rwsem_down_write_failed: + pushl %edx + pushl %ecx + callrwsem_down_write_failed + popl%ecx + popl%edx + ret + +.align 4 +.globl __rwsem_wake +__rwsem_wake: + pushl %edx + pushl %ecx + callrwsem_wake + popl%ecx + popl%edx + ret diff -uNr linux-2.4.4-pre2/include/asm-i386/rwsem-spin.h linux/include/asm-i386/rwsem-spin.h --- linux-2.4.4-pre2/include/asm-i386/rwsem-spin.h Thu Apr 12 08:57:28 2001 +++ linux/include/asm-i386/rwsem-spin.h Thu Apr 12 15:52:13 2001 @@ -8,6 +8,12 @@ #ifndef _I386_RWSEM_SPIN_H #define _I386_RWSEM_SPIN_H +#ifndef _LINUX_RWSEM_H +#error please dont include asm/rwsem-spin.h directly, use linux/rwsem.h instead +#endif + +#include linux/spinlock.h + #ifdef __KERNEL__ #define CONFIG_USING_SPINLOCK_BASED_RWSEM 1 @@ -16,7 +22,7 @@ * the semaphore definition */ struct rw_semaphore { - atomic_tcount; + signed long count; #define RWSEM_UNLOCKED_VALUE 0x #define RWSEM_ACTIVE_BIAS 0x0001 #define RWSEM_ACTIVE_MASK 0x @@ -53,7 +59,7 @@ #endif #define __RWSEM_INITIALIZER(name,count) \ -{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \ +{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \ __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ __RWSEM_DEBUG_INIT __RWSEM_DEBUG_MINIT(name) } @@ -66,7 +72,7 @@
Re: generic rwsem [Re: Alpha process table hang]
Andrea, How did you generate the 00_rwsem-generic-1 patch? Against what did you diff? You seem to have removed all the optimised i386 rwsem stuff... Did it not work for you? (the generic rwsemaphores in those kernels is broken, try to use them in other archs or x86 and you will notice) and I cannot reproduce the hang any longer. Can you supply a test case that demonstrates it not working? My generic rwsem should be also cleaner and faster than the generic ones in 2.4.4pre3 and they can be turned off completly so an architecture can really takeover with its own asm implementation. I quick look says it shouldn't be faster (inline functions and all that). However, I think you might be right about it being too dependent on the algorithm I put in, and that is easy to change. (while with the 2.4.4pre3 design this is obviously not possible because lib/rwsem.c compilation isn't conditional and such file knows the internals of the struct rw_semaphore). Could be very easily changed. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU
Robin Getz [EMAIL PROTECTED] wrote: David - Does this give you what you need? Possibly. I'll have a look at it tomorrow. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU
Aubrey Li [EMAIL PROTECTED] wrote: Here, in the attachment I wrote a small test app. Please correct if there is anything wrong, and feel free to improve it. Okay... I have that working... probably. I don't know what output it's supposed to produce, but I see this: # /packet-mmap/sample_packet_mmap 00-00-00-01-00-00-00-8a-00-00-00-8a-00-42-00-50- 38-43-13-a0-00-07-ff-3c-00-00-00-00-00-00-00-00- 00-11-08-00-00-00-00-01-00-01-00-06-00-d0-b7-de- 32-7b-00-00-00-00-00-00-00-00-00-00-00-00-00-00- 00-00-00-90-cc-a2-75-6b-00-d0-b7-de-32-7b-08-00- 45-00-00-7c-00-00-40-00-40-11-b4-13-c0-a8-02-80- c0-a8-02-8d-08-01-03-20-00-68-8e-65-7f-5b-7e-03- 00-00-00-01-00-00-00-00-00-00-00-00-00-00-00-00- 00-00-00-00-00-00-00-00-00-00-00-01-00-00-81-a4- 00-00-00-01-00-00-00-00-00-00-00-00-00-1d-b8-86- 00-00-10-00-ff-ff-ff-ff-00-00-0e-f0-00-00-09-02- 01-cb-03-16-46-26-38-0d-00-00-00-00-46-26-38-1e- 00-00-00-00-46-26-38-1e-00-00-00-00-00-00-00-00- 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00- [repeated] Does that look reasonable? I've attached the preliminary patch. Note four things about it: (1) I've had to add the get_unmapped_area() op to the proto_ops struct, but I've only done it for CONFIG_MMU=n as making it available for CONFIG_MMU=y could cause problems. (2) There's a race between packet_get_unmapped_area() being called and packet_mmap() being called. (3) I've added an extra check into packet_set_ring() to make sure the caller isn't asking for a combination of buffer size and count that will exceed ULONG_MAX. This protects a multiply done elsewhere. (4) The entire data buffer is allocated as one contiguous lump in NOMMU-mode. David --- [PATCH] NOMMU: Support mmap() on AF_PACKET sockets From: David Howells [EMAIL PROTECTED] Support mmap() on AF_PACKET sockets in NOMMU-mode kernels. Signed-Off-By: David Howells [EMAIL PROTECTED] --- include/linux/net.h|7 +++ include/net/sock.h |8 +++ net/core/sock.c| 10 net/packet/af_packet.c | 118 net/socket.c | 77 +++ 5 files changed, 219 insertions(+), 1 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index 4db21e6..9e77cf6 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -161,6 +161,11 @@ struct proto_ops { int (*recvmsg) (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len, int flags); +#ifndef CONFIG_MMU + unsigned long (*get_unmapped_area)(struct file *file, struct socket *sock, +unsigned long addr, unsigned long len, +unsigned long pgoff, unsigned long flags); +#endif int (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma); ssize_t (*sendpage) (struct socket *sock, struct page *page, @@ -191,6 +196,8 @@ extern int sock_sendmsg(struct socket *sock, struct msghdr *msg, extern int sock_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags); extern int sock_map_fd(struct socket *sock); +extern void sock_make_mappable(struct socket *sock, + unsigned long prot); extern struct socket *sockfd_lookup(int fd, int *err); #define sockfd_put(sock) fput(sock-file) extern int net_ratelimit(void); diff --git a/include/net/sock.h b/include/net/sock.h index 2c7d60c..d91edea 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -841,6 +841,14 @@ extern int sock_no_sendmsg(struct kiocb *, struct socket *, struct msghdr *, size_t); extern int sock_no_recvmsg(struct kiocb *, struct socket *, struct msghdr *, size_t, int); +#ifndef CONFIG_MMU +extern unsigned long sock_no_get_unmapped_area(struct file *, + struct socket *, + unsigned long, + unsigned long, + unsigned long, + unsigned long); +#endif extern int sock_no_mmap(struct file *file, struct socket *sock, struct vm_area_struct *vma); diff --git a/net/core/sock.c b/net/core/sock.c index 27c4f62..b288799
Re: [PATCH] fs/afs: Convert to kthread API.
Eric W. Biederman [EMAIL PROTECTED] wrote: This patch modifies the startup of kafscmd, kafsasyncd, and kafstimod to use kthread_run instead of a combination of kernel_thread and daemonize making the code slightly simpler and more maintainable. Please drop this patch for the moment as I have my own patches to convert them to keventd-type threads, in addition to implementing a host of other changes. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net/rxrpc: Convert to kthread API.
Eric W. Biederman [EMAIL PROTECTED] wrote: This patch modifies the startup of krxtimod, krxiod, and krxsecd to use kthread_run instead of a combination of kernel_thread and daemonize making the code slightly simpler and more maintainable. Again, please drop in favour of my RxRPC patches. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU
Aubrey Li [EMAIL PROTECTED] wrote: Yes, it's reasonable for me, as long as your host IP is 192.168.2.128 and target IP is 192.168.2.141 That is correct, yes:-) I expect it's an NFS packet as my board is using an NFS root at the moment. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Getting the new RxRPC patches upstream
Eric W. Biederman [EMAIL PROTECTED] wrote: What is the ETA on your patches? That depends on Dave Miller now, I think. I'm assuming they need to go through the network GIT tree to get to Linus. Certainly Andrew Morton seems to think so. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
Eric W. Biederman [EMAIL PROTECTED] wrote: Ok. I don't see any patches in -mm so I was assuming these patches have not been queued up anywhere. They haven't been quite yet. Is it your intention to kill these features in 2.6.22? David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net/rxrpc: Convert to kthread API.
Andrew Morton [EMAIL PROTECTED] wrote: Do those patches convert all this code over to full use of the kthread API? Because it seems that a conversion would be straightforward, and is needed. No. They delete all that code entirely and use workqueues instead. So, I suppose merging Eric's patches first should be a simple matter of just deleting his revised code instead. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU
Aubrey Li [EMAIL PROTECTED] wrote: The patch works properly on my side. But 1) I'm not sure why you re-wrote alloc/free_pg_vec function, doesn't the current implement work for NOMMU? I know you want to allocate the entire data buffer as one contiguous lump, but is it really necessary? Yes. It's not possible to map the whole buffer otherwise. Think about it! mmap() returns _one_ reference address. In MMU-mode, the non-contiguous physical buffers can be made to appear virtually contiguous by fudging the page tables and using the MMU. This is not possible in NOMMU-mode. The app will expect the buffer to be one contiguous lump in its address space, and will not be able to locate the other segments of the buffer. Actually, what I said is not quite true. It is possible to map the whole buffer otherwise: I could lift the restriction that requires that you map the whole buffer or not at all, and then userspace could stitch the whole lot together itself. This would then require userspace to be bimodal. 2) So the mapped pages doesn't count into NR_FILE_MAPPED, is it a problem? Not really, no - there are no pagetables. Furthermore, issuing the PACKET_RX_RING sockopt does the entire allocation. Any subsequent mmaps on it have little effect. We could do that accounting though if you think it'd be better. I don't suppose it hurts. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
David Miller [EMAIL PROTECTED] wrote: I applied already the patches I thought were appropriate, you had some crypto layer changes that you need to work out with Herbert Xu before the rest can be applied. Should the rest of it go via Andrew's tree then? David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU
Aubrey Li [EMAIL PROTECTED] wrote: as checked in packet_set_ring, buffer size must be a multiple of PAGE_SIZE, packet_set_ring if (unlikely(req-tp_block_size (PAGE_SIZE - 1))) So why not use __get_free_pages rather than kmalloc, Because kmalloc() may be able to get us a smaller chunk of memory. Actually, calling __get_free_pages() might be a better, and then release the excess pages. so that we have pagetables to count? There are no pagetables in NOMMU-mode. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
David Miller [EMAIL PROTECTED] wrote: Now that Herbert cleared up the crypto layer issues the only problem left is that there are generic changes in there which are not strictly networking but which your subsequent networking changes depend upon. This is a mess, and makes merging your work into the net-2.6.22 tree more difficult. There are only two non-net patches that AF_RXRPC depends on: (1) The key facility changes. That's all my code anyway, and shouldn't be a problem to merge unless someone else has put some changes in there that I don't know about. (2) try_to_cancel_delayed_work(). I suppose I could use cancel_delayed_work() instead, but that's less efficient as it waits for the timer completion function to finish. And one that AFS depends on: (3) Cache the key in nameidata. I still don't have Al's agreement on this, but it's purely caching, so I could drop that patch for the moment and excise the stuff that uses it from my AFS patches if that would help. Do you class the AFS patches as networking changes? Do you want me to consolidate my patches to make things simpler for you? Do you want me to rebase my patches onto net-2.6.22? I have the following patches, in order, available now, though I haven't yet released the last few (they can all be downloaded from my RH people pages): move-skb-generic.diff (you've got this) timers.diff keys.diff af_rxrpc.diff afs-cleanup.diff af_rxrpc-kernel.diff af_rxrpc-afs.diff af_rxrpc-delete-old.diff af_rxrpc-own-workqueues.diff af_rxrpc-fixes.diff afs-callback-wq.diff afs-vlocation.diff afs-multimount.diff afs-rxrpc-key.diff afs-nameidata-key.diff afs-security.diff afs-doc.diff netlink-support-MSG_TRUNC.diff (you've got this) afs-get-capabilities.diff afs-initcallbackstate3.diff afs-dir-write-support.diff David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU
Eric Dumazet [EMAIL PROTECTED] wrote: Is it really possible to allocate an order-10 page, then release part of it (say an order-8 subpage) ? Yes. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
We only care when del_timer() returns true. In that case, if the timer function still runs (possible for single-threaded wqs), it has already passed __queue_work(). Why do you assume that? David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
Oleg Nesterov [EMAIL PROTECTED] wrote: We only care when del_timer() returns true. In that case, if the timer function still runs (possible for single-threaded wqs), it has already passed __queue_work(). Why do you assume that? Sorry, I should have been more clear. I meant the assumption that we only care about a true return from del_timer(). If del_timer() returns true, the timer was pending. This means it was started by work-func() (note that __run_timers() clears timer_pending() before calling timer-function). This in turn means that delayed_work_timer_fn() has already called __queue_work(dwork), otherwise work-func() has no chance to run. But if del_timer() returns 0, then there may be a problem. We can't tell the difference between the following two cases: (1) The timer hadn't been started. (2) The timer had been started, has expired and is no longer pending, but another CPU is running its handler routine. try_to_del_timer_sync() _does_, however, distinguish between these cases: the first is the 0 return, the second is the -1 return, and the case where it dequeued the timer is the 1 return. BTW, can a timer handler be preempted? I assume not... But it can be delayed by interrupt processing. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
Oleg Nesterov [EMAIL PROTECTED] wrote: Let's look at (2). cancel_delayed_work() (on top of del_timer()) returns 0, and this is correct, we failed to cancel the timer, and we don't know whether work-func() finished, or not. Yes. The current code uses del_timer_sync(). It will also return 0. However, it will spin waiting for timer-function() to complete. So we are just wasting CPU. That's my objection to using cancel_delayed_work() as it stands, although in most cases it's a relatively minor waste of time. However, if the timer expiry routine gets interrupted then it may not be so minor... So, yes, I'm in full agreement with you there. I guess I misunderstood you. Perhaps, you propose a new helper which use try_to_del_timer_sync(), yes? Unless I missed something, this doesn't help. Because the return value == -1 should be treated as 0. We failed to stop the timer, and we can't free dwork. Consider how I'm using try_to_cancel_delayed_work(): I use this when I want to queue a delayed work item with a particular timeout (usually for immediate processing), but the work item may already be pending. If try_to_cancel_delayed_work() returns 0 or 1 (not pending or pending but dequeued) then I can go ahead and just schedule the work item (I'll be holding a lock to prevent anyone else from interfering). However, if try_to_cancel_delayed_work() returns -1 then there's no usually no point attempting to schedule the work item because I know the timer expiry handler is doing that or going to do that. The code looks like this in pretty much all cases: if (try_to_cancel_delayed_work(afs_server_reaper) = 0) schedule_delayed_work(afs_server_reaper, 0); And so could well be packaged into a convenience routine and placed in workqueue.[ch]. However, this would still concern Dave Miller as my patches would still be altering non-net stuff or depending on non-net patches he doesn't have in his GIT tree. Using cancel_delayer_work() instead would be acceptable, functionally, as that just waits till the -1 return case no longer holds true, and so always returns 0 or 1. In RxRPC, this is only used to cancel a pair global delayed work items in the rmmod path, and so the inefficiency of cancel_delayed_work() is something I can live with, though it's something I'd want to reduce in the longer term. In AFS, this is not only used in object destruction paths, but is also used to cancel the callback timer and initiate synchronisation processing with immediate effect. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
Oleg Nesterov [EMAIL PROTECTED] wrote: The current code uses del_timer_sync(). It will also return 0. However, it will spin waiting for timer-function() to complete. So we are just wasting CPU. That's my objection to using cancel_delayed_work() as it stands, although in most cases it's a relatively minor waste of time. However, if the timer expiry routine gets interrupted then it may not be so minor... So, yes, I'm in full agreement with you there. Great. I'll send the s/del_timer_sync/del_timer/ patch. I didn't say I necessarily agreed that this was a good idea. I just meant that I agree that it will waste CPU. You must still audit all uses of cancel_delayed_work(). Aha, now I see what you mean. However. Why the code above is better then cancel_delayed_work(afs_server_reaper); schedule_delayed_work(afs_server_reaper, 0); ? (I assume we already changed cancel_delayed_work() to use del_timer). Because calling schedule_delayed_work() is a waste of CPU if the timer expiry handler is currently running at this time as *that* is going to also schedule the delayed work item. If delayed_work_timer_fn() is not running - both variants (let's denote them as 1 and 2) do the same. Yes, but that's not the point. Now suppose that delayed_work_timer_fn() is running. 1: lock_timer_base(), return -1, skip schedule_delayed_work(). 2: check timer_pending(), return 0, call schedule_delayed_work(), return immediately because test_and_set_bit(WORK_STRUCT_PENDING) fails. I don't see what you're illustrating here. Are these meant to be two steps in a single process? Or are they two alternate steps? So I still don't think try_to_del_timer_sync() can help in this particular case. It permits us to avoid the test_and_set_bit() under some circumstances. To some extent, try_to_cancel_delayed_work is int try_to_cancel_delayed_work(dwork) { ret = cancel_delayed_work(dwork); if (!ret work_pending(dwork-work)) ret = -1; return ret; } iow, work_pending() looks like a more precise indication that work-func() is going to run soon. Ah, but the timer routine may try to set the work item pending flag *after* the work_pending() check you have here. Furthermore, it would be better to avoid the work_pending() check entirely because that check involves interacting with atomic ops done on other CPUs. try_to_del_timer_sync() returning -1 tells us without a shadow of a doubt that the work item is either scheduled now or will be scheduled very shortly, thus allowing us to avoid having to do it ourself. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Getting the new RxRPC patches upstream
Oleg Nesterov [EMAIL PROTECTED] wrote: Sure, I'll grep for cancel_delayed_work(). But unless I missed something, this change should be completely transparent for all users. Otherwise, it is buggy. I guess you will have to make sure that cancel_delayed_work() is always followed by a flush of the workqueue, otherwise you might get this situation: CPU 0 CPU 1 === === timer expires cancel_delayed_work(x) == 0 --delayed_work_timer_fn(x) kfree(x); --do_IRQ() y = kmalloc(); // reuses x --do_IRQ() __queue_work(x) --- OOPS --- That's my main concern. If you are certain that can't happen, then fair enough. Note that although you can call cancel_delayed_work() from within a work item handler, you can't then follow it up with a flush as it's very likely to deadlock. Because calling schedule_delayed_work() is a waste of CPU if the timer expiry handler is currently running at this time as *that* is going to also schedule the delayed work item. Yes. But otoh, try_to_del_timer_sync() is a waste of CPU compared to del_timer(), when the timer is not pending. I suppose that's true. As previously stated, my main objection to del_timer() is the fact that it doesn't tell you if the timer expiry function is still running. Can you show me a patch illustrating exactly how you want to change cancel_delayed_work()? I can't remember whether you've done so already, but if you have, I can't find it. Is it basically this?: static inline int cancel_delayed_work(struct delayed_work *work) { int ret; - ret = del_timer_sync(work-timer); + ret = del_timer(work-timer); if (ret) work_release(work-work); return ret; } I was thinking this situation might be a problem: CPU 0 CPU 1 === === timer expires cancel_delayed_work(x) == 0 --delayed_work_timer_fn(x) schedule_delayed_work(x,0) --do_IRQ() keventd scheduled x-work() --do_IRQ() __queue_work(x) But it won't, will it? Ah, but the timer routine may try to set the work item pending flag *after* the work_pending() check you have here. No, delayed_work_timer_fn() doesn't set the _PENDING flag. Good point. I don't think that's a problem because cancel_delayed_work() won't clear the pending flag if it didn't remove a timer. First, this is very unlikely event, delayed_work_timer_fn() is very fast unless interrupted. Yeah, I guess so. Okay, you've convinced me, I think - provided you consider the case I outlinded at the top of this email. If you give me a patch to alter cancel_delayed_work(), I'll substitute it for mine and use that that instead. Dave Miller will just have to live with that patch being there:-) David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/