[RFC] Task Ornaments

2001-01-26 Thread David Howells


PROPOSAL


I have written a patch (appended) that gives a task structure a list of
arbitrary "ornaments". Any number of ornaments can then be added without
increasing the size of the task_struct.

Each ornament has an operations table, much as do inodes and address
spaces. These operations include a destructor and routines for destroying and
duplicating ornaments under signal, exit, execve and fork/clone conditions.

For instance, the frame buffer drivers use a three fields in the task
structure to gain notification of signals that are blocked from being
delivered to a process:
- notifier
- notifier_data
- notifier_mask

Instead, these fields could be attached to an ornament which is only present
when the FB driver is doing this. They would use the signal notification
operation to perform this function.

In this way, the task_struct size could actually be _reduced_ by 4 bytes.

Functions are provided for traversing the list and calling the appropriate
notification routine on each ornament. Multiple traversals can be in progress
at once, and they can be recursive, and the list can be safely modified at the
same time.

The implementation I've appended does work in userspace tests (as far as it is
possible to test it there), though it has not been tried in the kernel as yet.

One optimisation that can be made is to make alloc_lock in the task_struct an
rwlock_t instead of a spinlock_t. This means that the list can be searched
simultaneously be several different processes.


REASON FOR IMPLEMENTATION
=

In the Win32 kernel module that I'm writing to help accelerate Wine, one of
the things I want to do is to be able to store a Win32 handle map in a task
structure (currently I attach it to an open fd).

There are a number of things my code needs to be able to do:

  (1) Destroy the handle map upon exit, signal and potentially execve.

  (2) Trim the handle map during execve.

  (3) Duplicate/share the handle map of a process with a child process during
  forking/cloning.

  (4) A process needs to be able to add objects to the handle map of another
  arbitrary process that also has a handle map.

  (5) A process needs to be able to gain notification of the death of another
  arbitrary process.

The way I'm proposing to do this is by providing notifications from certain
routines, such as do_fork().

There are a number of further considerations:

  (1) A notification service routine can sleep (eg: on kmalloc).

  (2) A service routine may modify the list (eg: remove itself).

  (3) The list must be guarded against modifications during traversal without
  preventing modifications from happening.


Thanks for your consideration/comments/better ways of doing it,
David

==
diff -uNr linux-2.4.1-pre10.orig/fs/exec.c linux-2.4.1-pre10/fs/exec.c
--- linux-2.4.1-pre10.orig/fs/exec.cFri Jan 26 15:57:12 2001
+++ linux-2.4.1-pre10/fs/exec.c Fri Jan 26 17:28:32 2001
@@ -541,6 +541,7 @@
if (retval) goto mmap_failed;
 
/* This is the point of no return */
+   task_ornament_notify_execve(current);
release_old_signals(oldsig);
 
current-sas_ss_sp = current-sas_ss_size = 0;
diff -uNr linux-2.4.1-pre10.orig/include/linux/sched.h 
linux-2.4.1-pre10/include/linux/sched.h
--- linux-2.4.1-pre10.orig/include/linux/sched.hFri Jan 26 15:57:16 2001
+++ linux-2.4.1-pre10/include/linux/sched.h Fri Jan 26 16:08:32 2001
@@ -394,7 +394,8 @@
u32 parent_exec_id;
u32 self_exec_id;
 /* Protection of (de-)allocation: mm, files, fs, tty */
-   spinlock_t alloc_lock;
+   rwlock_t alloc_lock;
+   struct list_head ornaments;
 };
 
 /*
@@ -474,7 +475,7 @@
 sig:   init_signals,  \
 pending:   { NULL, tsk.pending.head, {{0}}},  \
 blocked:   {{0}},  \
-alloc_lock:SPIN_LOCK_UNLOCKED  \
+alloc_lock:RW_LOCK_UNLOCKED\
 }
 
 
@@ -859,12 +860,12 @@
 
 static inline void task_lock(struct task_struct *p)
 {
-   spin_lock(p-alloc_lock);
+   write_lock(p-alloc_lock);
 }
 
 static inline void task_unlock(struct task_struct *p)
 {
-   spin_unlock(p-alloc_lock);
+   write_unlock(p-alloc_lock);
 }
 
 /* write full pathname into buffer and return start of pathname */
diff -uNr linux-2.4.1-pre10.orig/include/linux/taskornament.h 
linux-2.4.1-pre10/include/linux/taskornament.h
--- linux-2.4.1-pre10.orig/include/linux/taskornament.h Thu Jan  1 01:00:00 1970
+++ linux-2.4.1-pre10/include/linux/taskornament.h  Fri Jan 26 17:16:22 2001
@@ -0,0 +1,129 @@
+/* taskornament.h: task ornaments
+ *
+ * (c) Copyright 2001   David Howells ([EMAIL PROTECTED])
+ */
+
+#ifndef _H_F92CE22E_93FF_11D4_8781_C000574

Re: [PATCH] guard mm-rss with page_table_lock (241p11)

2001-01-30 Thread David Howells

...
 + spin_lock(mm-page_table_lock);
   mm-rss++;
 + spin_unlock(mm-page_table_lock);
...

Would it not be better to use some sort of atomic add/subtract/clear operation
rather than a spinlock? (Which would also give you fewer atomic memory access
cycles).

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: rwsem, gcc3 again

2001-05-16 Thread David Howells


Hi Andrea:

Here you go:

/usr/local/bin/gcc -D__KERNEL__ -I/inst-kernels/linux-2.4.5-pre2-aa/include -Wall 
-Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe 
-mpreferred-stack-boundary=2 -march=i686-DEXPORT_SYMTAB -c sys.c
sys.c: In function `sys_gethostname':
/inst-kernels/linux-2.4.5-pre2-aa/include/asm/rwsem_xchgadd.h:51: inconsistent
operand constraints in an `asm'

I've lit fires underneath some of our gcc people, and they say it's definitely
a bug in gcc. They're currently looking at it.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 00_rwsem-11, 2.4.4-ac11 and gcc-3(2001-05-14)

2001-05-21 Thread David Howells


The compiler should be now fixed in this respect, for both my stuff that's in
the kernel and Andrea's desired replacement. The problem appears to have been
triggered by having two input+output constraints (eg: +r, +m). However,
I can't test this because the head of the CVS trunk doesn't seem to have been
able to build and test successfully since just before the fix was
applied. (I'm going on the codesourcery builds for this).

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: __asm__

2001-05-23 Thread David Howells

 __asm__(and  1 %%esp.%0; :=r (current) : 0 (~8191UL));

This doesn't look right... Where did you get this from.

Taking the one in include/asm-i386/current.h as an example:

| __asm__(

This signifies a piece of inline assembly that the compiler must insert into
it's output code. The __asm__ is the same as asm, but can't be disabled by
command line flags.

| andl %%esp,%0

%% is a macro that expands to a %.
%0 is a macro that expands to the first input/output specification.

So in this case, it takes the stack pointer (register %esp) and ANDs it
into a register that contains 0xE000, leaving the result in that register.

Basically, the a task's task_struct and a task's kernel stack occupy an 8KB
block that is 8KB aligned, with the task_struct at the beginning and the stack
growing from the end downwards. So you can find the task_struct by clearing
the bottom 13 bits of the stack pointer value.

| ; 

The semicolon can be used to separate assembly statements, as can the newline
character escape sequence (\n).

| :=r (current)

This specifies an output constraint (all of which occur after the first colon,
but before the second). The '=' also specifies that this is an output. The 'r'
indicates that a general purpose register should be allocated such that the
instruction can place the output value into it. The bit inside the brackets -
'current' - is the intended destination of the output value (normally a local
variable) once the C part is returned to.

| : 0 (~8191UL));

This specifies an input constraint (all of which occur after the second colon,
but before the third). The '0' references another constraint (in this case,
the first output constraint), saying that the same register or memory location
should be used for both. The '~8191UL' inside the brackets is a constant that
should be loaded into the register allocated for the output value before using
the instructions inside the asm block.

See also the gcc info pages, Topic C Extensions, subtopic Extended Asm.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Re: __asm__ ]

2001-05-23 Thread David Howells

Okay,

current is a macro on i386 that expands to get_current(). This gets the
task_struct for the task currently running on the CPU executing the code.

It does this by masking out the bottom bits of its kernel stack pointer.

For example, assuming that some running process has the following task record
stored in an 8KB-aligned 8KB block.

0xD520BFFF  +---+
|   |
| kernel stack  |
|   |
0xD520B498  + TOS --+  -- stack pointer: %esp
|   |
| empty space   |
|   |
+---+
|   |
| task_struct   |
|   |
0xD520A000  +---+  -- get_current()

get_current() can work out where the base of this block is because the kernel
(1) stack pointer is always within it, (2) it's aligned in memory with respect
to its size:

get_current() { return %esp~8191; }
get_current() { return 0xD520B498  0xE000; }
get_current() = 0xD520A000

So current-fs is a structure that holds the current task's idea of its root
filesystem (chroot), current working directory (chdir) and current umask.

And so current-fs-root is the task's filesystem root dentry.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



rwsems and asm-constraint gcc bug

2001-05-23 Thread David Howells


The bug in gcc 3.0 that stopped the inline asm constraints being interpreted
properly, and thus prevented linux from compiling is now fixed.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: linux-2.4.3-ac14 spinlock problem?

2001-05-30 Thread David Howells


 I was running something on my Dell dual p3 box (optiplex gx300). my kernel
 is linux-2.4.3-ac14. I got the following message:

How often did this message occur?

 __rwsem_do_wake(): wait_list unexpectedly empty
 [4191] c5966f60 = { 0001 })
 kenel BUG at rwsem.c:99!
 invalid operand: 
 CPU:1
 EIP:0010:[c0236b99]
 EFLAGS: 00010282
 kenel BUG at /usr/src/2.4.3-ac14/include/asm/spinlock.h:104!
 
 
 I upgrade the kernel to 2.4.5, no such problem any more.

I suspect something else corrupted the rw-semaphore structure, but that's very
hard to prove unless you catch it in the act. If it happens again with any
frequency, you might want to try turning on rwsem debugging.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-07 Thread David Howells


 Please by no way don't include this patch into the official tree.

I wasn't intending to put it up for inclusion into the main kernel tree... I
think it is far better that it stays as a module, especially as it doesn't
require any changes to the main tree to be used, it can just be loaded when
one wants to use Wine.

 It's insane due to the following:
 
 1. Linux is UNIX not NT... (in terms of API)
 
 2. WINE in itself is barely usefull - even in fact non existant, since
 there is
no official stable release out there.

It is actually proving very useful, at least to me... It allows me to use PVCS
(a horrible java-based version control system for windows only) at work whilst
running Linux. It also allows me to run MSDEV without having to reboot.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-07 Thread David Howells


Tigran Aivazian [EMAIL PROTECTED] wrote:
 how about the classical standard way of "getting to kernel space", i.e.
 plain system calls? Unless you really need a huge number of new system
 calls,

Just the one... this'll take a parameter specifying the Win32 call to make:

 int win32(int fd, enum win32op, void *args, u_int *win32error)

The fd is currently used to find the Win32 handle map, which is attached to
struct file-f_private on that fd. I don't know any other way of doing this at
the moment (at least so the handle map automatically gets destroyed when the
owning process expires).

One thing that might be useful in this regard is the ability to add callbacks
into a list on the task structure, such that they are invoked when the process
exits/execs (and at no other time).

 what you can do is in the module, scan for sys_ni_syscall entries
 in the sys_call_table[] table (see entry.S) and replace them with pointers
 to functions in your module's address space. Then you need to communicate
 the slot numbers thus occupied back to userspace, e.g. via /proc file
 which then can invoke them in a standard int 0x80 manner.

Yes... that's a good idea... I already use a /proc file to provide access to
the kernel module, I can just extend that.

 Or you can hardcode the slots but that is not flexible because unoccupied
 slots may become occupied at any time in the future.

Unless I can get one reserved, though I suspect that won't happen.

 Also, if you do this - don't forget to inc/dec module refcount in each of
 your new syscalls, for obvious reasons - if the syscall sleeps and the
 module is unloaded, it will return to "nowhere".

That is actually quite a big problem... How do you stop someone accessing the
syscall whilst the module's cleanup function is running, but before it has a
chance to clear the syscall slot? Does module unloading lock out other
syscalls?

This may require some sort of in-the-main-kernel syscall register/unregister
functionality that takes a handler pointer _and_ a module definition pointer.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-07 Thread David Howells

Andi Kleen [EMAIL PROTECTED] wrote:
 It could use a read-only-to-clients shared memory with some locking tricks.

You still have to be able to emulate WaitForMultipleObjects() which I think is
quite difficult from userspace. It can perhaps be done with signals, but that
then incurs costs in invoking signal handlers. And however it is done, it
would require thread-safe wait queues to be implemented in userspace.

The kernel's wait queues make this a lot easier, quicker and safer.

 I looked a bit over the code. Your Mutex classes do not look very SMP safe,
 have they been tested with SMP ?

Look carefully... It uses the atomic bit set/clear functions to modify the
state, and the wait-queue carries its own lock.

And yes, they have been SMP tested... I have a dual Pentium-II box at home. I
ran my five-thread test programs on it and they ran extremely fast without
crashing (faster than Win2000 on the same box with the same test).

 Also I guess it would be better to implement Mutexes in user space as far as
 possible and only use kernel help for inter process mutexes.

What's the difference? And how do you tell that a mutex is going to be used
only between threads and not between processes? Anonymous mutexes don't
actually help - think of DuplicateHandle!

Plus it only greatly complicates your WaitFor* functions.

 For the object name management I did not found any limits that would prevent
 an user from allocating all memory. Are there any?

getname() implies MAXPATHLEN (the VFS name to kernel-space function).

 wineserver_read is definitely buggy, it holds spinlocks while calling
 functions that may sleep (e.g. the copy_to_user in wineserver_read_printf)

Good point.

 Overall it does not look too bad. It would be nice though if there was an
 easy way to extend the object management from user space, because Windows
 probably has a lot of object types Linux definitely does not want in kernel
 space.

There will be... though there aren't actually all that many "kernel" object
handles... most of what are called "handles" are really just userspace object
handles issued by userspace DLL's.

However... for instance, the way I'm planning on implementing the registry
through this module is to have a cut down RPC server (using the RPC mechanism
available in that module) that gets invoked implicitly by RegOpenKey, et al.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-07 Thread David Howells


Andi Kleen [EMAIL PROTECTED] wrote:
 But that's not race free on SMP. Two CPUs can set the bit in parallel
 and you'll never notice. You would need at least a protecting spinlock 
 between the test bit and set bit (or a cmpxchg on x86) 

Are you sure? I understood that the "lock" prefix on a i386 made the
instruction it guarded SMP safe.

If not, I suppose I can use the xchg() macro instead.

Hold on a moment... You said "between the test bit and set bit"... this is a
single CPU instruction! With the lock prefix, there should be no between.

Also, a quote from asm/bitops.h:
- /*
-  * These have to be done with inline assembly: that way the bit-setting
-  * is guaranteed to be atomic. All bit operations return 0 if the bit
-  * was cleared before the operation and != 0 if it was not.
-  *
-  * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
-  */

Doesn't "atomic" mean SMP safe?

What's the point in the lock prefix if it doesn't make things SMP safe (after
all, the unadorned instructions are UP safe...)?

 That also does not help for protecting your objects, it only protects
 the wait queue.

Yes.

I don't see what more needs to be done, at least on the mutex. It is either
available when we come to grab it, and we find this out and grab it in a
single operation, or it isn't, and we wait.

Furthermore, the objects also have atomic_t reference counts like other kernel
modules, so they aren't going to go away whilst in use.

 The processes look it up in a shared name table (shmem). Yes they have 
 to negoitate in this case -- it'll only be a win when they usually do not
 share.

And really horrible when they do share. Plus you still have to jump through
all the same synchronisation hoops to control access to the shmem that the
kernel does when governing access to its objects.

 Probably, it is a DoS. 

Surely a DoS attack is only possible if there is a limit in force? Namespace
DoS attacks are possible whatever the underlying system, wineserver, kernel
module, or Windows itself.

Of course, when I said "number of processes (not threads) * ~1020" what I
actually meant is "number of interested processed...", since any process not
interested can't actually issue the Win32 calls anyway.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-08 Thread David Howells


Horst von Brand [EMAIL PROTECTED] wrote:
 Lost me there. If after releasing the mutex it is free, the release was
 sucessful AFAIAC. If two threads try to do it at the same time, so what?
 Releasing an already free mutex is broken, OK. But two threads owning the
 mutex at the same time is much worse...

Because the mutex release is done in two stages: (1) clear the state bit and
(2) set the owner pointer to NULL.

This is a sufficient window for a potential mutex grabber to jump in grab the
mutex and then have stage (2) clobber the owner pointer.

Anyway, I've changed this now (though I haven't got around to testing it yet):
On platforms that support cmpxchg(), it uses the owner pointer as the state
and exchanges it to NULL to release the mutex, and exchanges it to the owner
pointer when grabbing.

On platforms that don't, a spinlock is added to the WineMutex structure.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-11 Thread David Howells


Dan Maas [EMAIL PROTECTED] wrote:
 This is done all the time -- with ioctl(). It's perfectly normal to create a
 special character device that just responds to an ioctl for each operation
 you want to perform. See eg any sound card driver...

Yes, that's how I'm doing it at the moment. However, this incurs a penalty
because of the the number of standard ioctl()'s the system checks for first,
before passing the command to my handler.

What Robert suggested was to use write() to do the deed, which I don't think
is a good idea.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-11 Thread David Howells


Oliver Neukum [EMAIL PROTECTED] wrote:
 What is the difference to get one reserved syscall and multiplex it ?

This is what I'd like to be able to do... that way the checks that ioctl()
performs can be avoided.

However, there are problems with doing this:

(1) There's currently no definitive way to grab an unused syscall, whether a
random one or a pre-determined one.

(2) The syscall table is not exported...

(3) Even if it was... just filling in the syscall slot from a module means
that it is possible for the module to be unloaded whilst the syscall is in
use.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-11 Thread David Howells


Oliver Neukum [EMAIL PROTECTED] wrote:
 So ask Linus for one. The streams group got one. Why shouldn't yo ?

Well, that's up to Linus... but from his email on this subject, he might well.

 Having a static syscall should be more efficient, too.

A little... otherwise it's a matter once per process of reading a line from
the /proc file, pulling out the number (or reading via an ioctl) and keeping
it in a global variable, rather than having it as immediate data on an
assembly instruction.

 It need not be.

True... but it isn't at the moment.

 You should be able to steal a solution from knfsd. It faces the same problem.
 Of course it takes a small stub that's always in kernel, but so what.

Interesting idea... I'll take a look. Maybe it'll be possible to share
something.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-11 Thread David Howells

"Albert D. Cahalan" [EMAIL PROTECTED] wrote:
 The system call is needed of course, since that lets Linux executables
 (perhaps ones being ported from Win32) use the new features.

It also means that non-i386 and non-wine use these services if they want to.

 You might as well also handle int 0x2e with the same code.

I suppose... provide three modules maybe: (1) the object and handle support
mechanism; (2) the Win32 API (what of it I decide to implement in a kernel
module) and syscall; and (3) the Native API and interrupt.

 See www.sysinternals.com for some native API documentation.
 In addition to the web site, there are articles and a book.

Been there... he says that you can't get all the API documentation.

 The native API is generally better for Linux, because it is smaller
 and less bloated than Win32.

Agreed.

 Putting Win32 in the kernel is almost like putting libc in the kernel. You
 can make exceptions for calls that need the greatest performance and/or
 don't translate well.

Also agreed, particularly when GDI calls are considered. Though this also
applies to the Native API.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-12 Thread David Howells


David Woodhouse [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] said:
   The best solution is to have a stub which is always resident that
  does the inc/dec of the module count.  This stub can reserve the
  syscall number as well. 
 
 _Please_ don't have a stub in the kernel which is conditionally compiled 
 according to the value of a _MODULE option. 

I have to admit, the thought hadn't occurred to me to do that. It almost
sounds like a good idea... after all the stub can't be used if no modules can
be loaded. Are you thinking, then, that the stub can't be used by compiled in
(ie non-module) code? Or am I misunderstanding what you mean?

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-12 Thread David Howells


David Woodhouse [EMAIL PROTECTED] wrote:
 Unless it's _absolutely_ necessary, the kernel image (i.e. vmlinux) should
 not contain any code which is dependent on CONFIG_*_MODULE options. 
 
 Therefore, stuff like...
 
 #ifdef CONFIG_WIN32_MODULE
   EXPORT_SYMBOL(my_win32_helper_func);
 #endif
 
 ...would mean that you have to recompile the kernel and reboot to enable the
 module, rather than just compiling and loading the module.

Ah... I did misunderstand you. I thought you meant CONFIG_MODULES in general,
which'd be okay - obviously, if module support is disabled, you can't load a
module anyway.

No, I wasn't planning to do that. I was thinking more along the lines of
adding another handler in to struct exec_domain, but that isn't especially
generic.

I think now that I'm probably best providing a generic pluggable syscall
handler, one that is very careful to make sure the syscall can't be entered
whilst the module is being unloaded.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-12 Thread David Howells


Richard Guenther [EMAIL PROTECTED] wrote:
 Whats the problem with just not allowing the module to unload at all?
 Whats the point in unloading a module anyways. Ok, I know - debugging,
 etc. - but for a "release" version this is not important. Also upgrading -
 but for desktop boxes (for which this wine stuff is) rebooting seems
 appropriate here, too.

It's not all that hard to do... and it does mean you can have demand-loadable
and -unloadable modules. Also I am writing/debugging it at the moment, and so
I like being able to unload and reload the module.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-12 Thread David Howells


David Woodhouse [EMAIL PROTECTED] wrote:
 You don't need to add another handler. I already overloaded the lcall7
 handler by passing an extra int into it to tell it the type of call which is
 causing it to be invoked. Values which are already used are 7 for iBCS calls
 (lcall7) and 0x27 for Solaris/x86 syscalls (lcall 0x27). Pick another
 number, and possibly reassign those two to 1 and 2 respectively because it's
 cleaner like that.

But where do I get the other argument (struct pt_regs *) from? A normal Linux
syscall does not appear to have access to such a beast...

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-12 Thread David Howells


Oliver Neukum [EMAIL PROTECTED] wrote:
  I think now that I'm probably best providing a generic pluggable syscall
  handler, one that is very careful to make sure the syscall can't be entered
  whilst the module is being unloaded.
 
 This seems to me the best idea. However I would argue against dynamically
 allocating syscalls. Reserving numbers makes for better code and allows you
 to do autoloading.

Now there's an idea... have the kernel autoload modules based on a particular
syscall number being handled in that module. Madness *grin*.

You still have to have a static syscall number though...

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-13 Thread David Howells


David Woodhouse [EMAIL PROTECTED] wrote:
 The code I posted yesterday is a bit of an abuse of the personality 
 mechanism, but ought to work nonetheless. Didn't you like it?

I didn't know how to get hold of a "struct pt_regs*" till someone sent me a
message this morning (it was obvious really). Plus, it occurred to me that I
already have this fd I can use for reference counting.

 [EMAIL PROTECTED] said:
   thus retaining the existence of the struct file, which then retains
  the module in memory.
 
 er.. why does it keep the module in memory?

Because "getting" the struct file from the fd increments the usage count on
the struct file. The VFS automatically maintains the module count pointed to
by the file_operations structure based on the existence of the struct file.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-15 Thread David Howells

At the moment, the Win32 syscalls I implement require an fd to be attached to
a particular proc file. This fd holds the Win32 handle map.

The VFS provides auto-refcounting on modules that provide file operations,
thus the syscall stub only needs to check that the fd provided has the correct
set of file operations attached and increment the refcount on the fd itself.

This has the added advantage that not only will the module not go away during
a syscall, but the handle map will not depart either.

There are other ways to make sure the module doesn't go away, for instance
using personalities, but I still need to maintain the refcount on the handle
map fd.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] To export the access_process_vm function

2000-09-15 Thread David Howells

Could you please apply this little patch to export access_process_vm()?

Cheers,
David Howells
=

diff -uNr linux-2.4.0-test8-orig/kernel/ksyms.c linux-2.4.0-test8/kernel/ksyms.c
--- linux-2.4.0-test8-orig/kernel/ksyms.c   Fri Sep 15 00:04:36 2000
+++ linux-2.4.0-test8/kernel/ksyms.cMon Sep 11 23:39:38 2000
@@ -123,6 +123,7 @@
 EXPORT_SYMBOL(find_vma);
 EXPORT_SYMBOL(get_unmapped_area);
 EXPORT_SYMBOL(init_mm);
+EXPORT_SYMBOL(access_process_vm);
 #ifdef CONFIG_HIGHMEM
 EXPORT_SYMBOL(kmap_high);
 EXPORT_SYMBOL(kunmap_high);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup dynamic system calls are not going to happen.

2000-09-18 Thread David Howells


[EMAIL PROTECTED] (Linus Torvalds) wrote:
 David Howells  [EMAIL PROTECTED] wrote:
 
 Oliver Neukum [EMAIL PROTECTED] wrote:
   (3) Even if it was... just filling in the syscall slot from a module means
   that it is possible for the module to be unloaded whilst the syscall is in
   use.

 Note that all this is not the problem.

 The problem is that dynamic system calls are not going to happen.

 License issues. I will not allow system calls to be added from modules.
 Because I do not think that adding a system call is a valid thing for a
 module to do. It's that easy.

 It's the old thing about "hooks". You must not sidestep the GPL by just
 putting a hook in place. And dynamic system calls are the ultimate hook.

What about nfsd's "nfsservctl" syscall then? This appears to be modulable...

Anyway, I didn't mean using a random syscall number provided by the kernel
upon loading the module, but having one number preallocated for the purpose as
is done in knfsd.

Do you think, then, that I should be putting my stuff up for inclusion in the
kernel proper (say v2.5)? Or do you think I shouldn't be using a syscall at
all? Or should I just be offering it as a kernel patch that has to be applied
for use?

David Howells

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-18 Thread David Howells


  At the moment, the Win32 syscalls I implement require an fd to be
  attached to a particular proc file. This fd holds the Win32 handle map.

 Huh?

Each process needs a handle map... To avoid playing with the task structure,
fork, exec, exit, signals, etc., I used an fd attached to a proc file to keep
track of the handle map (it gives me reference counting, module tracking, and
auto-cleanup on process death).

The handle map page is just attached to the private data pointer on the struct
file. It was just convenient for starting the whole thing off.

  There are other ways to make sure the module doesn't go away, for
  instance using personalities, but I still need to maintain the
  refcount on the handle map fd.

 Please don't mess with personalities. I intend to use any useful
 Win32 or NT-native features in Linux software. Personalities
 would make these calls only useful for Wine.

Agreed.

 If a Win32 handle can be though of as a file descripter,
 then just do that. You can invent new file types and new
 filesystems as needed. These are available:

 0x3000 003
 0x5000 005
 0x7000 007
 0x9000 011
 0xb000 013

I was trying to avoid using file/inode/dentry structs as much as
possible... they aren't really necessary for most of this, and are very
heavyweight.

 If that isn't right, add a CLONE_HANDLES flag for clone() and
 add a fd-like table hanging off the task struct. Go all the way,
 rather than leaving these features as second-class citizens.

I have thought of separating the code in to separate lumps:

  (1) An in-kernel resident lump, providing very basic services:

  * handle/handle-map management
  * object management
  * some object types
  * process death-knell notification
  * file-change notification
  * unicode string handling/conversion (steal/share from NTFS)
  * simple RPC mechanism for building certain services in userspace
  * Win32 syscall stub
  * int 0x2e stub
  * /proc view

  (2) A partial basic Win32 implementation (modulable):

  * basic Win32 syscalls, eg: CreateMutexA/W
  * _no_ GDI calls

  (3) A partial basic Native API implementation (modulable):

  * basic NT syscalls
  * again, no GDI calls

  (4) A userspace registry service, communicating with registry stubs in the
  kernel by the simple RPC mechanism.

 I'd rather not have kernel code that is _only_ good for Win32.
 I don't see much reason to have a nasty add-on that can't be
 used by regular Linux programs.

Agreed.

The question is, how much will Linux go for?

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-26 Thread David Howells


"Albert D. Cahalan" [EMAIL PROTECTED] wrote:
 Well, if this isn't worth doing right... Patch size is just something
 you have to deal with. Hopefully you can get an early-2.5.x merge.

I wouldn't like to bet on it.

 Why limit it? If the existing poll method needs extension, do it.

I suppose so. On a separate issue, the more I look at the poll and select
implementation, the more I'm convinced there's a leak, but I'm not sure it
matters.

 That may be, but we deal with it all the time. The VFS code has to
 be well-optimized and SMP-safe. Besides, many of these Win32 handles
 will refer to files anyway, won't they?

Run HandleEx sometime... most of the handles are generally not files,
particularly for a program that uses the perfmon stuff, when it uses heaps of
registry handles and section handles.

* The dentry filename support does not handle separate namespaces.
 
 By this you mean...? Can you not just apply a prefix?
 Perhaps you could use the half-way mounted filesystems,
 with some way to specify the desired filesystem root.

Actually, I suppose could look at how /proc does things. Make anonymous
objects just numbered files in a special directory.

 Well, that would be generally useful for Linux software.

But hard to achieve at the moment... Not only would the VFS have to change,
but all the filesystems too.

 Also useful: an open-read-close operation that doesn't muck around
 with fd allocation. Maybe hang the data off of a struct stat, so that
 callers can atomicly get everything.

Better still, just give the syscall a filename, a buffer and a buffer size.

* Win32 access/share flags would have to be retained in the file
  struct, and the inode struct would have to maintain a list of these.
 
 OK. Problem?

Not a nice thing to add. Care has been taken to decouple the file from the
inode (a dentry sandwich). Also it means more maintanence work on the part of
the VFS.

Care would also have to be taken to avoid a reference count loop: File points
to Inode points to File. Not insurmountable though.

 The list would be NULL most of the time. If Linux apps start
 using this feature a lot, then it can be optimized.

No. The list could never be NULL.

 Excuse me? Does Win32 not allow rename of open files?

It does, but the internal object name does not change. You cannot create a new
file of the old name whilst it is still open internally.

 Assuming that is the problem, you have a few options I think.
 You could let handles block rename, except that you might let
 root override restrictions caused by non-root users.

*shiver*

 You could just ignore this problem and see what, if anything, breaks.

I really don't like doing that. It's always an important customer who falls
over it.

 You could lie I suppose, supplying old names as needed.

Then you should stick with the current system.

 Doesn't that strike you as being a hack that will cause you
 to need _more_ code in the end?

No.

 You'll end up writing a second VFS layer if I'm not mistaken.

No I won't. I wasn't really thinking in terms of having a process handles map
as a directory full of fd's, more like a flat file with a list in it.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-26 Thread David Howells


Alexander Viro [EMAIL PROTECTED] wrote:
 Then these apps are non-portable to other Unices and either get fixed or
 get rm'd. Period.

Definitely.

 In case you've missed that, Win32 is misdesigned crap. It's one thing
 to add binary compatibility _if_ it doesn't affect the native stuff, but
 merging that abortion into the native API Just Is Not Going To Happen.

I agree with you on that. I'd quite like a Win32 API to be available, but I
wouldn't like it to have too much impact on the main Linux kernel.

I presume that you don't have a problem with it implemented in a module for
use in accelerating Wine?

 The longer I'm looking at that thread the more it looks like an attempt to
 merge incompatible models. Last time that happened we got Missed'em'V and
 that's going to be _much_ worse. Mixing v7 with CP/M gave DOS. List goes
 on and on and there is _nothing_ but shit on it.

True the two are pretty much incompatible... but I can implement a large
amount of the basic Win32 API in kernel space without affecting on the rest of
the kernel.

I will, however, supply a couple of patches to add useful extra features that
are internal to the kernel anyway, and don't affect the userspace API
semantics. These include process death notification (which someone else
expressed interest in for some other purpose) and some optional hooks to make
it faster and easier to use my implementation.

 Folks, get real - if you want NT you know where to find it.

Yes - written on the reverse of my cup mat *grin*.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-20 Thread David Howells


Daniel Pittman [EMAIL PROTECTED] wrote:
 *blink*  I confess that I can't see that much of a requirement for the
 ptrace stuff, but I take your word for it. :)

I was under the impression that the wineserver used ptrace() to retrieve
argument data to requests such as CreateFile(), but thinking about it, that
can't be so, since that'd mean you couldn't strace a wine program, which you
can.

It must instead pass all the argument data either by socket or by shared
memory (I'm not sure which offhand).

It does, however, (at least, I think it does), use this for debugging support,
but that's a special case.

  Secondly, the client (wine) just has to issue a syscall that looks
  exactly like a Win32 registry function, how it is handled is hidden by
  the kernel.
 
 Not really something that I would have thought was needed -
 specifically, the userspace libraries that the process was dynamic
 linked to would have been the place to do that, I thought.

I suppose this is more of an implementation detail, but it's primarily why I
added an RPC mechanism.

  Thirdly, registry functions should issue system handles, as is done on
  Windows. If system handles move into the kernel, then registry handles
  should do too. This also means registry change notifications can be
  implemented by system handles that WaitFor*() functions can deal with.
 
 That, however, is a convincing reason.

And that's why I wanted a kernel interface to the registry stuff, for which
the RPC mechanism was added.

 IIRC there are not any waitable GHI objects, precisely because under
 Win32 the GDI is a userspace library that does most of the magic
 internally and then, finally, requests that the server does something.
 
 Like sockets, the GDI isn't actually a Win32 kernel thing, it's a Win32
 userspace wrapper that makes the minimalist kernel support more
 palatable. 

However, I could make a Window a waitable GDI object... implementing
GetMessage() as Wait, GetAsync.

What I'd need to do is make it possible for the RPC server to create
signallable handles of effectively user-defined type that can be issued by and
passed through RPC calls. Hmmm... I'll have to think about that, but I'll
probably have to do something along these lines to support registry change
notifications.

  Now there's an idea... Have the local X server as an RPC server,
  handling wine GDI calls directly *grin*. Then I could hide all the GDI
  stuff behind syscall stubs in the same way. This would allow me to
  implement a lot of the NT Native API.
 
 Well, that's a point of view thing. The native NT API isn't anything
 beyond kernel32, as such. Certainly, the GDI isn't part of the core API
 and, in some cases, MicroSoft do release or use the NT core without the
 GUI system, at the least.

But it is there. I think it may use WinStation handles somehow, but I don't
know for sure without a bit of disassembly work.

 Anyhow, I am hardly an expert on the area, but... I would have avoided
 putting stubs into the kernel anywhere that it was possible and used the
 userspace dynamic Win32 libraries to do most of the work.

It depends on what you want... What I'd like is basically a range of Win32
system handles that map to (or apparently map to) kernel services. I'd also
like these to be available for ordinary Linux programs to play with if they
want to.

However, I think I might be able to see what you mean... I could implement an
RPC server, say, that has one exported operation: do registry op. Then have
all the registry operations mapped down by a userspace library to the RPC
invocation call to a single operation on that particular server.

 OTOH, I would also have implemented the WaitFor* set as a wrapper over
 select, with the kernel module providing suitable file descriptors for
 adding to the set that you waited for.

This is one of the things I was trying to avoid:

 (1) select and poll do not implement the entire range of things you can do
 with WaitForMultipleObjects;

 (2) the struct file+struct dentry+struct inode complex is incredibly heavy
 for most of what I want to do (though does have advantages);

 (3) Linux file structures do not hold enough information to support
 CreateFile (access  sharing interactions).

 We have fairly different views on how this stuff would hang together
 and, in all likelyhood, mine are based on incorrect assumptions. :)

True... Isn't it nice when you can provide what API's you want rather than
having to make do with whatever is dictated to you. Yes... I'm implementing
some of the Windows system APIs because I'd like to have that functionality
available under Linux running as fast as possible, not because it's the only
one worth using:-)

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-20 Thread David Howells


 Reiserfs would make a fine registry.

Interesting idea... Have a Linux-style registry hive file mounted as a
reiserfs filesystem.

However, I'd prefer to be able to use Windows-style registry hive files if
possible.

   (3) Linux file structures do not hold enough information to support
   CreateFile (access  sharing interactions).
 
 Didn't IRIX add the access  sharing support for Samba to use?
 This looks like a generally useful feature.

I presume you mean SGI added it to IRIX? As for generally useful... possibly.

The sys_open() and sys_unlink() functions could be modified to make use of it
if it was stored on the inode. You could then use sys_CreateFileA() to
actually use it fully.

However, it won't work over NFS...

David Howells



Re: [RFC] Wine speedup through kernel module

2000-09-21 Thread David Howells


Waldek Hebisch [EMAIL PROTECTED] wrote:
   (2) the struct file+struct dentry+struct inode complex is incredibly heavy
   for most of what I want to do (though does have advantages);
 Well, all (ok most) of this complexity is to support Unix semantics. 
 Adding structures with different sematics will (IMHO) mean huge 
 increase in complexity (if interactions are handled correctly) and 
 invites bugs 

I think we have a misunderstanding here... I meant that using the VFS
structures for Win32 objects like mutexes, semaphores and events is massive
overkill, and uses a great deal of unnecessary memory.

What I did was to have a separate structure for representing Win32 objects
that is kept totally separate from the rest of the Linux kernel VFS stuff. It
is also a lot smaller than the three-layer VFS system, and only has the bits
that are necessary for implementing Win32 object handling and waiting.

Note though, in the case of a file handle, this Win32 object contains a
pointer to a struct file, but this is the only way to gain access to the VFS
through this system, and it still isolates the VFS effectively.

   (3) Linux file structures do not hold enough information to support
   CreateFile (access  sharing interactions).
 Any information not in kernel structure is Wine specific anyway, so 
 should be separate

Definitely, and I've kept it so.

What I meant on the CreateFile() front is that this function takes an access
bitmask and a share bitmask which interact with other CreateFile()'s and some
other functions. For instance, if two separate CreateFile()'s are issued on a
file then the second is rejected if it's share mask excludes the first's
access mask or if the first's share mask excludes the second's access mask.

It gets a bit more complex as more CreateFile()'s succeed on a particular
file, since the access flags are additive, and the share masks are
subtractive.

The point, though, is that this information must be retained to reproduce the
semantics of the relevant functions. Wine keeps it in the wineserver process,
and uses RPC over sockets to get to it.

 Everybody sometimes gets an idea which looks very promising (and yes, I 
 too like my ideas more then the other). 
 However, Unix has quite a lot of mechanisms for interprocess 
 comunication: signals, shared memory, SYSV IPC semaphores and messages, 
 sockets, pipes. A lot of kernel structers and code is just to make 
 this work efficiently and relaiably. Putting in the kernel subsystem 
 which to large degree has the same functionality (but with incompatible
 interface) seems like a recipe for disaster (small is the project 
 dies, big is it makes its way to the kernel). 

Why should it be a disaster? It can just be kept as a module that is loaded to
accelerate Wine.

 On construtive grounds: I think that nobody would object putting 
 into kernel few primitives if they are critical for Wine performance. 

With what API?

 However, I would rather belive that they already exists. As long 
 as I understand the main problem is Wine I/O performance. The other 
 problem is to have efficient mutexes. I expect SYSV stuff to be 
 reasonably efficient, and very fast mutexes may be implemented 
 (almost) in user space using ix86 assembly. With shared memory and 
 fast mutexes it should be possible to move I/O to client side even 
 in multithreading programs. 

That entails all sorts of other problems... On Win32, a lot of these objects
are named... That means you have to have some sort of atomic-access name table
in shared memory. And you also have to implement WaitFor*() without doing a
tight polling loop (whilst you could use signals, that is slow). Also you have
to be able to do release-on-death management (how to catch SIGKILL?).

The thing about the kernel is that it has pre-existing services to solve these
problems: locks, wait-queues and atomic counters.

You can solve _most_ of these problems in userspace without a separate server
process, but at the cost of a great deal of extra complexity and at the cost
of not being able to provide the full behaviour.

 I think that Linux programmers rather than new syscalls would like 
 Wine to provide ability to handle metafiles (it happened), OLE 
 streams or W32 widgets. So, from such a point of view us much 
 as possible interfaces should be Unix-like, and W32 interfaces 
 used only when needed.

Or maybe, they'd like both to be available. The two aren't mutually exclusive.

 I understand that Wine has good reasons to use W32 interfaces internally,
 but pushing them to Linux kernel seems bad for me.

Don't use it then... it'll not be mandatory. Wine has also to support OS's
that can't or won't add Win32 support in the kernel.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-19 Thread David Howells


Bernd Eckenfels [EMAIL PROTECTED] wrote:
* file-change notification
 this is interesting for other stuff too, i think irix has an interface for
 that, i think its an ioctl?

Someone did a file-change notification patch for Linux. I'm not sure exactly
what became of it, but it'd be nice to be able to build on top of it.

* unicode string handling/conversion (steal/share from NTFS)
 are you sure this needs to be done in the kernel? if you wat to steal the
 sourcecode of the ntfs driver do it, but why would u wat to share object
 code for that?

Implementation of Win32 wide-character functions and NT Native API calls would
require such things. And if I can share them with other bits of the kernel
(there's more than just NTFS - smbfs, msdosfs, iso9660), then I can possibly
get away with implementing less myself, and bloating the kernel less.

* simple RPC mechanism for building certain services in userspace
 what is that? message passing? in kernel space?

Not quite, though it could be used for that... It allows a client to give a
server access to parts of its address space, so that the server can provide
certain services. The main use I'm thinking of is having the kernelspace stubs
of the registry functions transparently invoke RPC requests in a userspace
registry server.

(2) A partial basic Win32 implementation (modulable):
 
* basic Win32 syscalls, eg: CreateMutexA/W
* _no_ GDI calls
 
 actuelly kCGI is there?

kCGI? What's that?

I suppose GDI calls _could_ be done as RPC requests, similar to how I'm
intending to do registry calls, but that would incur major overheads as MS
found between NT3.51 and NT4, hence why they moved the GDI implementation from
CSRSS.EXE into the kernel.

However, I was thinking more along the lines of GDI being implemented in
userspace by the client process, eg: wine.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



latest cut of wine server kernel module

2000-09-15 Thread David Howells


What's new:

 * It can now make use of a dedicated Linux syscall rather than an ioctl() to 
   provide some Win32 functions.

 * There is a patch for strace-4.2 included to view usage of this syscall.

 * It has an RPC mechanism allowing a server process to gain controlled access
   to a client's VM.

 * Some bug fixes.

It can be downloaded from:

   ftp://ftp.infradead.org/pub/people/dwh/wineservmod-2915.tar.bz2

   [Thanks to David Woodhouse for this]

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-21 Thread David Howells


"Albert D. Cahalan" [EMAIL PROTECTED] wrote:
 In spite of that, it should be considered. It allows this:
 
 $ ls -log /proc/self/fd
 total 0
 lrwx--1 acahalan   64 Sep 21 09:12 0 - /dev/pts/4
 lrwx--1 acahalan   64 Sep 21 09:12 1 - /dev/pts/4
 lrwx--1 acahalan   64 Sep 21 09:12 2 - /dev/pts/4
 lrwx--1 acahalan   64 Sep 21 09:12 3 - mutex:[720429]
 lrwx--1 acahalan   64 Sep 21 09:12 4 - event:[592]
 lr-x--1 acahalan   64 Sep 21 09:12 5 - /proc/14527/fd

Looks nice, I know, but it may mean file handles are indirect, in that you
have:

  struct file-struct dentry-struct inode-struct winefile-struct file-...

if you can't store all the extra wine attributes on the struct file. It also
means that the inode or the dentry has to maintain a list of attached file's,
which I don't think it does at the moment.

  Any information not in kernel structure is Wine specific anyway, so
  should be separate
 
 It goes into the kernel structure, so that it won't be Wine-specific.
 SGI has already done this so that Samba would interact properly with
 regular UNIX software and the NFS server.
 
 (one might support SGI's API for this)

Hmmm... worth investigating. I take it that this was only done for IRIX, not
Linux.

 Yep, share bits definitely belong in the kernel. How else could
 you properly protect against regular (clueless) Linux software?

Probably do...

 We already have 3 ways to do file locking, so this is only 33% more.

66%... Don't forget LockFile/UnlockFile. These work very similarly to fcntl,
I think, but demand mandatory locking.

 Now THAT would be a disaster. Linux software ought to be able to rely
 on having these features available.

I agree, but does Linus?

 Modules are very bad for this.

Probably, but unless the code goes into the kernel proper, patch maintainance
can be a real nightmare. The greater the amount of kernel actually changed by
a patch, the worse it is.

 Named? That means you use the VFS.

Not necessarily. For file handles, you probably should (unless you want
drive-letter mappings to occur in the kernel - yuk), but the other things are
effectively in separate namespaces.

 Don't use it then... it'll not be mandatory. Wine has also to support OS's
 that can't or won't add Win32 support in the kernel.

You misunderstand me... _Using_ this API will not be mandatory, whether or not
it exists in the kernel proper.

 See a pattern here? System calls are not supposed to be modular.

knfsd? But I agree, really. But what is the likelyhood?

 We shouldn't have calls that appear and disappear on a whim.

True... that smacks of MS policy:-)

 Binary compatibility requires that system calls be available on
 all Linux systems.

Also true. You should be able to compile the module I currently have on all
Linux systems. Unfortunately, I can't test this at the moment.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: GCC proposal for @ asm constraint

2000-09-19 Thread David Howells


I've been writing a kernel module and I've noticed a measurable performance
drop between the same code compiled against linux-2.4.0-test7 and against
test8. I disassembled the code to try and work out what was going on and I saw
the following happen:

 * [test8]
   One of the atomic memory access primitives supplied by the kernel
   was putting immediate data into a register outside of the inline-asm
   instruction group and then using the register inside.

 * [test7]
   The immediate data was passed directly to the inline-asm instruction.

In test8, of course, this means that the compiler has to scratch up a spare
register, which is totally unnecessary, as the immediate data could be
attached directly to the instruction opcode as was done in test7.

This had the effect of making the compiler have to push the old contents of
the register into a slot on the stack (I think it held a local variable at the
time), which had the further effects of using more stack memory, introducing
more register rearrangement (the code ended up longer), and burning up more
CPU cycles.

I can't remember exactly what it was now, but I think it was either something
to do with spinlocks or bitops. I'll re-investigate tonight and see if I can
come back with some benchmarks/code-snippets tomorrow.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-19 Thread David Howells


Daniel Pittman [EMAIL PROTECTED] wrote:
 Hrm. It would seem more sensible to me that the registry, like the GDI,
 live outside the kernel. This would have some cost in terms of
 performance, I admit, but I don't think it's significant enough to care.
 
 This is, I confess, based on my personal experience with Win32
 development and may not represent the real state of affairs for all
 developers.
 
 Having registry access be performance critical or for large data is also
 against the MicroSoft best practice for software, which makes the costs
 lower - no multiple calls to ship large quantities of data. :)
 
 Anyway, why did you see the need for the registry stubs in this case, if
 I may ask?

Firstly, in wine the registry is handled by the wineserver process. This means
it can be shared between all the wine processes running for a particular
used. So when wine wants to access the registry, it has to do an RPC call to
the wineserver (sending a request across a UNIX socket and then, I think, a
bit of ptrace()'ing to get at the data). This provides a hopefully more
suitable RPC mechanism.

Secondly, the client (wine) just has to issue a syscall that looks exactly
like a Win32 registry function, how it is handled is hidden by the kernel.

Thirdly, registry functions should issue system handles, as is done on
Windows. If system handles move into the kernel, then registry handles should
do too. This also means registry change notifications can be implemented by
system handles that WaitFor*() functions can deal with.

 Well... wont the GDI calls be RPC anyway - specifically, across the X11
 pipe?

Yes, but if I implement them to go to a GDI server, which then talks X11 out
the back, that's two lots of RPC calls.

However, what I said about system handles and the registry, may also apply to
GDI objects.

 In any case, I would have expected (personally, from assumptions) that
 the GDI would live inside the process space and pay the same cost as
 existing X11 applications to do it's display.

Yes, that is my current plan.

 I can't see any sensible reason for implementing anything extra in the
 kernel to support it. Now, adding an X11 protocol extension to make some
 of the Win32 stuff work better...

Now there's an idea... Have the local X server as an RPC server, handling wine
GDI calls directly *grin*. Then I could hide all the GDI stuff behind syscall
stubs in the same way. This would allow me to implement a lot of the NT Native
API.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-22 Thread David Howells


[EMAIL PROTECTED] (Albert D. Cahalan) wrote:
 In spite of that, it should be considered. It allows this:
 
 $ ls -log /proc/self/fd
 total 0
 lrwx--1 acahalan   64 Sep 21 09:12 0 - /dev/pts/4
 lrwx--1 acahalan   64 Sep 21 09:12 1 - /dev/pts/4
 lrwx--1 acahalan   64 Sep 21 09:12 2 - /dev/pts/4
 lrwx--1 acahalan   64 Sep 21 09:12 3 - mutex:[720429]
 lrwx--1 acahalan   64 Sep 21 09:12 4 - event:[592]
 lr-x--1 acahalan   64 Sep 21 09:12 5 - /proc/14527/fd

There's another minor problem with using fd's as handles... Windows expects
the handle number to be a multiple of 4, and sometimes, probably only on rare
occasions, uses the bottom two bits to pass extra information. I was flicking
through the Native API reference last night, and came across a couple of
instances. I don't know whether this ever applies to Win32, though.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [RFC] Wine speedup through kernel module

2000-09-25 Thread David Howells


"Albert D. Cahalan" [EMAIL PROTECTED] wrote:
 #define HANDLE_TO_FD(x) ((x)2)
 #define FD_TO_HANDLE(x) ((x)2)

(not quite as simple as that since fd 0 is valid and handle 0 is not, but
 that's a very minor issue.)

I'm still not keen on the idea, though... One of the things I'm trying to
avoid is having to maintain a large patch to the kernel. I've done it before
with the configuration manager I wrote. At the time, there was a new version
of the kernel out at least once a week, and I spent nearly all my time porting
the patch to the latest kernels.

I can see your point... though it could get very heavy on fd's. There are some
advantages to the fd approach:

  * You could use WaitFor*() on ordinary fds. It could use the internal poll
method for non-Win32 fds.

  * The /proc/*/fd/ driver could be modified to support Win32 handles,
probably by extending the file_operations or inode_operations struct.

  * The dentry stuff has filename support.

  * Object retension is managed by the VFS.

  * No need for kernel based fd-handle translation.

  * You could make the Win32 objects appear in the normal UNIX VFS view of the
world by having a new filesystem mounted somewhere appropriate.

Also some disadvantages:

  * The VFS is quite heavy on kernel memory: three layers of structure - file,
dentry and inode. The inode structure being a bit of a monster.

  * The dentry filename support does not handle anonymous files.

  * The dentry filename support does not handle separate namespaces.

  * Can't pass initialisation data to the open routine.

  * Win32 access/share flags would have to be retained in the file struct, and
the inode struct would have to maintain a list of these.

  * A file handle's view of the filename shouldn't get changed even if the
file is renamed whilst open.

And some problems either way:

  * The dentry filename support does not handle wide character names or
case-independent matching.

One other thing: the idea you put forth about looking at /proc/12345/fd/ could
be kept, but as /proc/12345/handles instead with a non-fd approach.

David Howells
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Wine + kernel ?? How to do that?

2001-02-28 Thread David Howells


[EMAIL PROTECTED] wrote:
 hey, I hear that wine ( windows emulator ) can port into kernel and make 
 it running faster, How can I do it? 
 or anyone can make a patch to add wine code into kernel? 
 waiting for answer, Thanks 

I've been writing one to provide all the Windows kernel objects in Linux
kernel space (the speed up appears as though it should be impressive). It is,
however, not entirely complete yet. You can grab a copy by CVS from the wine
repository:

export CVSROOT=:pserver:[EMAIL PROTECTED]:/home/wine
cvs login
  (the password is cvs)
cvs -z3 checkout kernel-win32

Or you can browse it:

http://cvs.winehq.com/cvsweb/kernel-win32/

The numbers are looking good: the system call latency appears to be about half
that of Win2000 on the same box! (however, use this number with caution).

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Kernel is unstable

2001-03-02 Thread David Howells


Linus Torvalds [EMAIL PROTECTED] wrote:
 Also note that the merging tests were not free, so at least under my set of
 normal load the non-merging code is actually _faster_ than the clever
 optimized merging. That was what clinched it for me: I absolutely hate to
 see complexity that doesn't really buy you anything noticeable.

Surely, doing the merge will always have take longer than not doing the merge,
no matter how finely optimised the algorithm... But merging wouldn't be done
very often... only on memory allocation calls.

Or do you mean that use of the resultant VMA chain/tree is slower? This I find
suprising since after merging the VMA tree should at worst as complex as it
was before merging, and fairly likely simpler.

Perhaps it'd be reasonable to only do VMA merging on mmap calls and not brk
calls, and let brk manually extend an existing VMA if possible.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: patch: test12-pre7 cd stuff

2000-12-08 Thread David Howells

 Dec  8 06:05:29 agate kernel: hdc: packet command error: status=0x51 { 
 DriveReady SeekComplete Error }
 Dec  8 06:05:29 agate kernel: hdc: packet command error: error=0x50
 Dec  8 06:05:29 agate kernel: ATAPI device hdc:
 Dec  8 06:05:29 agate kernel:   Error: Illegal request -- (Sense key=0x05)
 Dec  8 06:05:29 agate kernel:   Invalid field in command packet -- 
 (asc=0x24, ascq=0x00)
 Dec  8 06:05:29 agate kernel:   The failed "Play Audio MSF" packet 
 command was:
 Dec  8 06:05:29 agate kernel:   "47 00 00 00 02 00 3f 24 ff 00 00 00 "

I think I've seen that as well, though I'm using a test10-ish kernel. I
assumed it was something to do with the CD being duff though.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH,preliminary] cleanup shm handling

2000-12-08 Thread David Howells


 here is my first shot for cleaning up the shm handling. It did survive
 some basic testing but is not ready for inclusion. 

Can you help me with an SHM related problem?

I'm currently writing a Win32 emulation kernel module to help speed Wine up,
and I'm writing the file mapping support stuff at the moment
(CreateFileMapping and MapViewOfFile).

I have PE Image mapping just about working (fixups, misaligned file sections
and all), but I'm trying to think of a good way of doing anonymous shared
mappings without having to hack the main kernel around too much (so far I've
only had to add to kernel/ksyms.c).

Is there a reasonable way I could hook into the SHM system to "reserve" a
chunk of shared memory of a particular size, and then a second hook by which I
can "map" _part_ of that into a process's address space?

Cheers,
David

PS: to anyone else reading this, if you wrote do_generic_file_read(), it
rocks!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH,preliminary] cleanup shm handling

2000-12-13 Thread David Howells

 I'm currently writing a Win32 emulation kernel module to help speed Wine up,
   ^
 fd = shm_open ("xxx",...)
 ptr = mmap (NULL, size, ..., fd, offset);

I am doing this from within kernel space. I'd like to avoid doing the full
open and mmap if possible. I was wondering if there're some shortcuts I could
make use of.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH,preliminary] cleanup shm handling

2000-12-13 Thread David Howells


 If you want to change the vma ops table you can replace the f_ops
 table with your own one. SYSV ipc uses this also to be able to catch
 unmaps.

I'd thought of that, but that means I need to concoct an f_ops table of my own
for every f_ops table I might have to override. All I want to do is provide a
single VMA ops table (or maybe two), possibly with only a few ops in.

Also, I can't actually go through do_mmap()... PE Image sections in files do
not have to be page aligned. If they are, I can call do_mmap() a number of
times (once per section), but mostly they're not (they have to be at least
512b aligned though - DOS floppy alignment, I suspect).

Plus if I change the f_ops table, then I affect normal Linuxy processes doing
mmap().

  I'm not sure how shared sections in PE Images are handled on all
 ...
 Oh, that's too much Windows for me ;-)

*grin*

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [andrea@suse.de: Re: generic rwsem [Re: Alpha process table hang]]

2001-04-20 Thread David Howells

Linus Torvalds [EMAIL PROTECTED] wrote:
 I think Andrea is right. Although this file seems to be entirely
 old-fashioned and should never be used, right?

I presume you're talking about "include/asm-i386/rwsem-spin.h"... If so,
Andrea is right, there is a bug in it (repeated a number of times), though why
the tests succeeded, I'm not sure.

The file should only be used for the 80386 and maybe early 80486's where
CMPXCHG doesn't work properly, everything above that can use the XADD
implementation.

 Also, I _really_ don't see why the code is inlined at all (in the real
 linux/rwsem-spinlock.h. It shouldn't be. It should be a real function
 call, and all be done inside lib/rwsem.c inside a
 
   #ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
 
 or whatever.

Andrea seems to have changed his mind on the non-inlining in the generic case.

But if you want it totally non-inline, then that can be done. However, whilst
developing it, I did notice that that slowed things down, hence why I wanted
it kept in line.

I have some ideas on how to improve efficiency in that one anyway, based on
some a comment from Alan Cox.

 Please either set me straight, or send me a patch to remove
 asm-i386/rwsem-spin.h and fix up linux/rwsem-spinlock.h. Ok?

I think there are two seperate issues here:

  (1) asm-i386/rwsem-spin.h is wrong, and can probably be replaced with the
  generic spinlock implementation without inconveniencing people much.
  (though someone has commented that they'd want this to be inline as
   cycles are precious on the slow 80386).

  (2) "fix up linux/rwsem-spinlock.h": do you want the whole generic spinlock
  implementation made non-inline then?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rwsem benchmarks [Re: generic rwsem [Re: Alpha process table hang]]

2001-04-20 Thread David Howells


 About the benchmark you wrote it looks good measure to me, thanks.

As with all benchmarks, take with one pinch of salt and two of Mindcraft:-)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



i386 assembly timing issue

2001-04-20 Thread David Howells

I've attached two slightly different bits of i386 assembly that achieve the
same end, but in slightly different ways. Can some one tell me why Case 1 is
faster than Case 2? Case 1 involves an extra CALL instruction.

* Case 1 has a little wrapper function that saves ECX and EDX before
  calling rwsem_wake().

* Case 2 merges the contents of the wrapper with the caller.

Case 1 is what's generated by the rw-semaphore inline assembly code as of
2.4.4-pre5. Case 2 looks like it ought to be a faster version of the same
thing.

David

###
#
# CASE 1: registers saved in the rwsem_wake register saving stub
#
.text
.align 16

#
# void test_up_read(struct rw_semaphore *sem)
# {
#   up_read(sem);
# }
#
.globl test_up_read
.type   test_up_read,@function
test_up_read:
movl4(%esp), %eax
movl$-1, %edx
xadd%edx,(%eax)
js  test_up_read_contention
test_up_read_done:
ret

#
# Register saving stub for rwsem_wake
#
.globl __rwsem_wake
__rwsem_wake:
pushl   %edx
pushl   %ecx
callrwsem_wake
popl%ecx
popl%edx
ret

#
# Contention handler stub for up_read
#
.section .text.lock,"ax"
test_up_read_contention:
decl%edx
testl   $65535,%edx
jnz test_up_read_done
call__rwsem_wake
jmp test_up_read_done

###
#
# CASE 2: registers saved in the contention handler stub
#
.text
.align 16

#
# void test_up_read(struct rw_semaphore *sem)
# {
#   up_read(sem);
# }
#
.globl test_up_read
.type   test_up_read,@function
test_up_read:
movl4(%esp), %eax
movl$-1, %edx
xadd%edx,(%eax)
js  test_up_read_contention
test_up_read_done:
ret

#
# Contention handler stub for up_read
#
.section .text.lock,"ax"
test_up_read_contention:
decl%edx
testl   $65535,%edx
jnz test_up_read_done
pushl   %edx
pushl   %ecx
call__rwsem_wake
popl%ecx
popl%edx
jmp test_up_read_done
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] generic rw_semaphores, compile warnings patch

2001-04-20 Thread David Howells

David S. Miller [EMAIL PROTECTED] wrote:
 D.W.Howells writes:
   This patch (made against linux-2.4.4-pre4) gets rid of some warnings obtained 
   when using the generic rwsem implementation.

 Have a look at pre5, this is already fixed.

Not entirely so...

There's also a missing "struct rw_semaphore;" declaration in linux/rwsem.h. It
needs to go in the gap below "#include linux/wait.h". Otherwise the
declarations for the contention handling functions will give warnings about
the struct being declared in the parameter list.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-23 Thread David Howells

David Woodhouse [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] said:
   [BTW, another solution is to truly support opaque handles to kernel
  objects; I believe David Howells is already working on something like
  this for Wine?

Yes. However, it uses a different system call set to use them. They translate
to small object structures internally.

  The poll interface can be trivially extended to support
  waiting on those...]

No, they aren't files. I did not want to use files because this would incur
a fairly major penalty for each object:

struct file + struct dentry + struct inode

Which would mean that Win32 File objects would require two of each, one set to
hold the extra Win32 attributes and one set for the actual Linux file.

The way I've chosen uses somewhat less memory and should be faster.

 ISTR it wasn't quite trivial to do it that way - it would require the 
 addition of an extra argument to the fops-poll() method.

Yes, the PulseEvent operation demands that all processes currently waiting on
the event should be woken, but that no processes attaching immediately
afterward get triggered.

This means that the PulseEvent handler has to be able to notify all the
processes currently waiting on the queue and only those processes. I got it to
do this by marking the waiter records each process links into the queue.

Oh... and WaitForMultipleObjects also has a wait for all option.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: light weight user level semaphores

2001-04-23 Thread David Howells

Alon Ziv [EMAIL PROTECTED] wrote:
 Obviously... since they're handles, not FDs...
 [BTW, are you using Windows' idea of storing the objects in process space,
 in a page that's inaccessible to the app itself, and passing pointers into
 this page as the handles?]

No... I grab a page in kernel space and use it as an array. One problem is
that if an exit occurs, I have to be able to discard all attached objects
after the process's VM has been cleaned up (ie: what if it gets swapped
out?). Plus, mmap can clobber existing mappings, MapViewOfFile can't.

 So what if they aren't files?

Small structures private to my Win32 module.

 I'm afraid I'm not following your logic in this; I believe most Win32 attrs
 can be mapped to more generic abstractions which should be able to exist at
 'struct file' level.

Most...

It'd mean adding extra fields into struct file (and possibly struct inode)
just for the use of this module (which probably wouldn't be accepted).

 (And even if not, a Win32 file handle could just hold two pointers---

No. the extra data has to be accessible from CreateFile (potentially running
in other processes), and this'd mean it'd have to go speculatively searching
all Win32 handle tables currently in use.

 And breaks _completely_ with the existing scheme :-/

So what? This is for a WINE accelerator/Win32 module only. There's already
been an argument over making the whole lot available as general Linux
functionality, but most people said that it'd be a bad idea because it'd not
be portable.

 Huh? Where did you get this?
 Looking at my copy of MSDN (July '00), the PulseEvent remarks more-or-less
 suggest an implementation like
 SetEvent(e)
 ResetEvent(e)

Consider the following:

WAITER 1WAITER 2WAITER 3WAKER
wait-on-event   wait-on-event   wait-on-event
sleep   sleep   sleep
PulseEvent
set-event
wake(WAITER 1)
wake(WAITER 2)
wake(WAITER 3)
reset-event
wakewakewake
what-happened?  what-happened?  what-happened?
nothing!nothing!nothing!
sleep   sleep   sleep

All three waiters should wake up with a note that the event triggered, but
they don't. Plus a fourth waiter who begins to wait on the event after the
set-event is issue probably shouldn't wake up.

 I wonder if it's possible to add _just_ this to poll()...

No... there's no way to pass this to poll (or select).

Better to add a WaitForMultipleObjects() syscall and have that call
do_select() with a flag.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]

2001-04-24 Thread David Howells

 Ok I finished now my asm optimized rwsemaphores and I improved a little my
 spinlock based one but without touching the icache usage.

And I can break it. There's a very good reason the I changed __up_write() to
use CMPXCHG instead of SUBL. I found a sequence of operations that locked up
on this.

Unfortunately, I can't remember what it was. I do have it written down at
home, I think, so I see about sending it to you later.


 The main advantage of my rewrite are:

 - my x86 version is visibly faster than yours in the write fast path and
   it also saves icache because it's smaller

That's because your fast write path is wrong. You can get away without
clobbering EDX and ECX, which I can't.

 (read fast path is reproducibly a bit faster too)

| +static inline void __down_read(struct rw_semaphore *sem)
...
| +  : +m (sem-count), +a (sem)

From what I've been told, you're lucky here... you avoid a pipeline stall
between whatever loads sem into EAX and the INCL that increments what it
points to because the +a constraint says that EAX will be changed. This
means that the compiler saves EAX before entering the inline asm, thus
delaying the INCL and thus avoiding the stall. However, you generally seem to
gain this at the expense of clobbering another register, say EDX.

So, for a function that just calls down_read twice, I get:
  20:   8b 44 24 04 mov0x4(%esp,1),%eax
  24:   f0 ff 00lock incl (%eax)
  27:   0f 88 22 00 00 00   js 4f dr2+0x2f
  2d:   f0 ff 00lock incl (%eax)
  30:   0f 88 30 00 00 00   js 66 dr2+0x46
  36:   c3  ret

And you get:
   0:   83 ec 08sub$0x8,%esp
   3:   8b 54 24 0c mov0xc(%esp,1),%edx
   7:   89 d0   mov%edx,%eax
   9:   f0 ff 02lock incl (%edx)
   c:   0f 88 fc ff ff ff   js e dr+0xe
  12:   89 d0   mov%edx,%eax
  14:   f0 ff 02lock incl (%edx)
  17:   0f 88 0f 00 00 00   js 2c dr2+0xc
  1d:   58  pop%eax
  1e:   5a  pop%edx
  1f:   c3  ret

Note also your asm constraints cause the compiler to eat an extra 8 bytes of
stack and then to pop it into registers to get rid of it. This is a gcc bug,
and will only hurt if the up_read and down_read are done in separate
subroutines.

| +static inline void __up_read(struct rw_semaphore *sem)
...
| +  jnz 1b\n\t
| +  pushl %%ecx\n\t
| +  call rwsem_wake\n\t

Putting a PUSHL or two there hurt performance when I tried it, because, I'm
told, it introduces a pipeline stall.

 - the common code automatically extends itself to support 2^32
   concurrent sleepers on 64bit archs

You shouldn't do this in the XADD case, since you are explicitly using 32-bit
registers and instructions.

Actually, both of our generic cases allow 2^31 sleepers on a 32-bit arch, and
by changing mine to a long I can make it so we both support up to 2^63 on a
64-bit arch. However, I suspect that is overkill...

 - there is no code duplication at all in supporting xchgadd common code
   logic for other archs (and I prepared a skeleton to fill for the alpha)

Why not make it shareable? It doesn't have to be mandatory...

 So I'd suggest Linus to apply one of my above -8 patches for pre7. (I hope I
 won't need any secondary try)

I recommend against this at the moment (there's a bug in __up_write in his X86
optimised code).

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] rw_semaphores, optimisations try #3

2001-04-24 Thread David Howells

Linus Torvalds [EMAIL PROTECTED] wrote:
 Note that the generic list structure already has support for batching.
 It only does it for multiple adds right now (see the list_splice
 merging code), but there is nothing to stop people from doing it for
 multiple deletions too. The code is something like
 
   static inline void list_remove_between(x,y)
   {
   n-next = y;
   y-prev = x;
   }
 
 and notice how it's still just two unconditional stores for _any_ number
 of deleted entries.

Yes but the struct rwsem_waiter batch would have to be entirely deleted from
the list before any of them are woken, otherwise the waking processes may
destroy their rwsem_waiter blocks before they are dequeued (this destruction
is not guarded by a spinlock).

This would then reintroduce a second loop to find out which was the last block
we would be waking.

 Anyway, I've already applied your #2, how about a patch relative to that?

Attached.

David
=
diff -uNr linux-rwsem-opt2/include/linux/rwsem-spinlock.h 
linux/include/linux/rwsem-spinlock.h
--- linux-rwsem-opt2/include/linux/rwsem-spinlock.h Tue Apr 24 10:51:58 2001
+++ linux/include/linux/rwsem-spinlock.hTue Apr 24 08:40:20 2001
@@ -1,6 +1,8 @@
 /* rwsem-spinlock.h: fallback C implementation
  *
  * Copyright (c) 2001   David Howells ([EMAIL PROTECTED]).
+ * - Derived partially from ideas by Andrea Arcangeli [EMAIL PROTECTED]
+ * - Derived also from comments by Linus
  */
 
 #ifndef _LINUX_RWSEM_SPINLOCK_H
@@ -11,6 +13,7 @@
 #endif
 
 #include linux/spinlock.h
+#include linux/list.h
 
 #ifdef __KERNEL__
 
@@ -19,14 +22,16 @@
 struct rwsem_waiter;
 
 /*
- * the semaphore definition
+ * the rw-semaphore definition
+ * - if activity is 0 then there are no active readers or writers
+ * - if activity is +ve then that is the number of active readers
+ * - if activity is -1 then there is one active writer
+ * - if wait_list is not empty, then there are processes waiting for the semaphore
  */
 struct rw_semaphore {
-   __u32   active;
-   __u32   waiting;
+   __s32   activity;
spinlock_t  wait_lock;
-   struct rwsem_waiter *wait_front;
-   struct rwsem_waiter **wait_back;
+   struct list_headwait_list;
 #if RWSEM_DEBUG
int debug;
 #endif
@@ -42,7 +47,7 @@
 #endif
 
 #define __RWSEM_INITIALIZER(name) \
-{ 0, 0, SPIN_LOCK_UNLOCKED, NULL, (name).wait_front __RWSEM_DEBUG_INIT }
+{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
 
 #define DECLARE_RWSEM(name) \
struct rw_semaphore name = __RWSEM_INITIALIZER(name)
diff -uNr linux-rwsem-opt2/lib/rwsem-spinlock.c linux/lib/rwsem-spinlock.c
--- linux-rwsem-opt2/lib/rwsem-spinlock.c   Tue Apr 24 10:51:58 2001
+++ linux/lib/rwsem-spinlock.c  Tue Apr 24 08:40:20 2001
@@ -2,13 +2,15 @@
  *   implementation
  *
  * Copyright (c) 2001   David Howells ([EMAIL PROTECTED]).
+ * - Derived partially from idea by Andrea Arcangeli [EMAIL PROTECTED]
+ * - Derived also from comments by Linus
  */
 #include linux/rwsem.h
 #include linux/sched.h
 #include linux/module.h
 
 struct rwsem_waiter {
-   struct rwsem_waiter *next;
+   struct list_headlist;
struct task_struct  *task;
unsigned intflags;
 #define RWSEM_WAITING_FOR_READ 0x0001
@@ -19,7 +21,8 @@
 void rwsemtrace(struct rw_semaphore *sem, const char *str)
 {
if (sem-debug)
-   printk([%d] %s({%d,%d})\n,current-pid,str,sem-active,sem-waiting);
+   printk([%d] %s({%d,%d})\n,
+  current-pid,str,sem-activity,list_empty(sem-wait_list)?0:1);
 }
 #endif
 
@@ -28,11 +31,9 @@
  */
 void init_rwsem(struct rw_semaphore *sem)
 {
-   sem-active = 0;
-   sem-waiting = 0;
+   sem-activity = 0;
spin_lock_init(sem-wait_lock);
-   sem-wait_front = NULL;
-   sem-wait_back = sem-wait_front;
+   INIT_LIST_HEAD(sem-wait_list);
 #if RWSEM_DEBUG
sem-debug = 0;
 #endif
@@ -48,60 +49,58 @@
  */
 static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem)
 {
-   struct rwsem_waiter *waiter, *next;
-   int woken, loop;
+   struct rwsem_waiter *waiter;
+   int woken;
 
rwsemtrace(sem,Entering __rwsem_do_wake);
 
-   waiter = sem-wait_front;
-
-   if (!waiter)
- goto list_unexpectedly_empty;
-
-   next = NULL;
+   waiter = list_entry(sem-wait_list.next,struct rwsem_waiter,list);
 
/* try to grant a single write lock if there's a writer at the front of the 
queue
 * - we leave the 'waiting count' incremented to signify potential contention
 */
if (waiter-flags  RWSEM_WAITING_FOR_WRITE) {
-   sem-active++;
-   next = waiter-next;
+   sem-activity = -1

Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]

2001-04-24 Thread David Howells

 I'd love to hear this sequence. Certainly regression testing never generated
 this sequence yet but yes that doesn't mean anything. Note that your slow
 path is very different than mine.

One of my testcases fell over on it...

 I don't feel the need of any xchg to enforce additional serialization.

I don't use XCHG anywhere... do you mean CMPXCHG?

 yours plays with cmpxchg in a way that I still cannot understand

It tries not to let the active count transition 1-0 happen if it can avoid
it (ie: it would rather wake someone up and not decrement the count). It also
only calls __rwsem_do_wake() if the caller manages to transition the active
count 0-1.

This avoids another subtle bug that I think I have a sequence written down for
too.

 Infact eax will be changed because it will be clobbered by the slow path, so
 I have to. Infact you are not using the +a like I do there and you don't
 save EAX explicitly on the stack I think that's your bug.

Not so... my down-failed slowpath functions return sem in EAX.

 Again it's not a performance issue, the +a (sem) is a correctness issue
 because the slow path will clobber it.

There must be a performance issue too, otherwise our read up/down fastpaths
are the same. Which clearly they're not.

 About the reason I'm faster than you in the down_write fast path is that I
 can do the subl instead of the cmpxchg, you say this is my big fault, I
 think my algorithm allows me to do that, but maybe I'm wrong.

I used to do that.

 Unfortunatly I have to put the pushl there because I don't want to save
 %ecx in the fast path (if I declare ecx clobbered it's even worse no?).

It benchmarked faster without it for me. I suspect this will be different on
different CPUs anyway.

I'm going to have to have a play with Intel's VTUNE program and see what it
says.

 I said on 64bit archs. Of course on x86-64 there is xaddq and the rex
 registers.

But the instructions you've specified aren't 64-bit.

 It isn't mandatory, if you don't want the xchgadd infrastructure then you
 don't set even CONFIG_RWSEM_XCHGADD, no?

My point is that mine isn't mandatory either... You just don't set the XADD
config option.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]

2001-04-24 Thread David Howells


 I see what you meant here and no, I'm not lucky, I thought about that. gcc x
 2.95.* seems smart enough to produce (%%eax) that you hardcoded when the
 sem is not a constant (I'm not clobbering another register, if it does it's
 stupid and I consider this a compiler mistake).

It is a compiler mistake... the compiler clobbers another register for
you. The compiler does not, however, know about timing issues with the
contents of the inline assembly... otherwise it'd stick a delay in front of
the XADD in my stuff.

 I tried with a variable pointer and gcc as I expected generated the (%%eax)
 but instead when it's a constant like in the bench my way it avoids to stall
 the pipeline by using the constant address for the locked incl, exactly as
 you said and that's probably why I beat you on the down read fast path too.
 (I also benchmarked with a variable semaphore and it was running a little
 slower)

*grin* Fun ain't it... Try it on a dual athlon or P4 and the answer may come
out differently.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rwsem benchmark [was Re: [PATCH] rw_semaphores, optimisations try #3]

2001-04-24 Thread David Howells

 so you reproduced a deadlock with my patch applied, or you are saying
 you discovered that case with one of you testcases?

It was my implementation that triggered it (I haven't tried it with yours),
but the bug occurred because the SUBL happened to make the change outside of
the spinlocked region in the slowpath at the same time as the wakeup routine
was running on the other CPU.

I'll have a look at the sequence and make sure that it does actually apply to
your implementation. It may not... but it doesn't hurt to check.

The thing occurred with one of my simple testcases, but only happened once in
a number of runs, fortunately whilst I had the rwsemtrace()'s enabled.

 yes of course, you use rwsem_cmpxchgw that is unnecessary.

Actually, I use this to try and avoid the following loop that you've got in
your code:

 +static void __rwsem_wake(struct rw_semaphore *sem)
...
 + again:
 + count = rwsem_xchgadd(-wait-retire, sem-count);
 + if (!wake_read  (count  RWSEM_READ_MASK)) {
 + count = rwsem_xchgadd(wait-retire, sem-count);
 + if ((count  RWSEM_READ_MASK) == 1)
 + goto again;

I now only have that loop in the rwsem_up_write_wake() function.

But! In mine, if __up_write()'s CMPXCHG failed, then it has also read the
counter, which it then passes as an argument to rwsem_up_write_wake(). This
means I can avoid the aforementioned loop in most cases, I suspect, by seeing
if the active counter was 1 at the time of the failed CMPXCHG.

This also means that if a ex-writer wakes up a writer-to-be, the only atomic
instruction performed on sem-count is the CMPXCHG in __up_write().

For the ex-writer waking up readers case, we have to perform an additional
XADD, but this must be done before anyone is woken up, else __up_read() can
get called to decrement the count before we've incremented it. I count the
number of things I want to wake up, adjust the count (one LOCKED ADD only) and
then wake the batch up.

You dive into a LOCKED XADD/XADD loop for each process you wake, which in the
best case will give you one LOCKED XADD per process.

Looking again at rwsem_up_read_wake()... I can't actually eliminate the
CMPXCHG there because the active count has already been decremented, and so
will need to be incremented again prior to a wake-up being performed. However,
if the increment was performed and someone else had incremented the count in
the meantime, we have to decrement it again... but this can cause a transition
back to zero, which we have to check for... and if that occurred...

You get the idea, anyway.

Oh yes... this idea should be faster on SPARC/SPARC64 and IA64 which don't
have useful XADD instructions (FETCHADD can only use small immediate values),
only CMPXCHG/CAS are really useful there.

 I guess I'm faster because I avoid the pipeline stall using +m (sem-count)
 that is written as a constant, that was definitely intentional idea.

+m doesn't avoid the stall. That's just shorthand for:

: =m(sem-count) : m(sem-count)

which is what mine has.

a+ luckily causes the avoidance by saying EAX gets clobbered, causing the
compiler to save via an additional register. Note that the compiler saves
anyway, even if the register will be discarded after being restored - yuk!

I think this one will depend on the surrounding code anyway. I suspect in some
chunks of C mine will be faster, and in others yours will be, all depending on
when EAX is loaded.

 My point is that when you set XADD you still force duplication of the header
 stuff into the the asm/*

If you want to use my XADD algorithm, then I think that's going to be fair
enough for now. You have to provide some extra routines anyway.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] rw_semaphores, optimisations try #3

2001-04-24 Thread David Howells

Linus Torvalds [EMAIL PROTECTED] wrote:
 - nobody will look up the list because we do have the spinlock at this
   point, so a destroyed list doesn't actually _matter_ to anybody

I suppose that it'll be okay, provided I take care not to access a block for a
task I've just woken up.

 - list_remove_between() doesn't care about the integrity of the entries
   it destroys. It only uses, and only changes, the entries that are still
   on the list.

True. Okay, I can change it to use that.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] rw_semaphores, optimisations try #4

2001-04-26 Thread David Howells

Andrea Arcangeli [EMAIL PROTECTED] wrote:
 It seems more similar to my code btw (you finally killed the useless
 chmxchg ;).

CMPXCHG ought to make things better by avoiding the XADD(+1)/XADD(-1) loop,
however, I tried various combinations and XADD beats CMPXCHG significantly.

Here's a quote from Borland assembler manual I managed to dig out, giving i486
timings on memory access:

   ADDL/SUBL3 cycles
   XADDL4 cycles
   CMPXCHG  8 cycles (success) / 10 cycles (failure)
   LOCK +1 cycle minimum on this CPU

In reality, however, XADDL gives at least as good a result as ADDL/SUBL, maybe
just a little bit better, but its hard to say. However, the penalty imposed on
the other CPU (when it has to flush it's cache) probably more than makes up
for the difference.

 I only had a short low at your attached patch, but the results are quite
 suspect to my eyes beacuse we should still be equally fast in the fast
 path and I should still beat you on the write fast path because I do a
 much faster subl; js while you do movl -1; xadd ; js, while according to
 your results you beat me on both. Do you have an explanation or you
 don't know the reason either?

MOVL $1,EDX
SUBL EDX,(EAX)

Works out faster than:

SUBL $1,(EAX)

as well... probably due to an avoided stall when the instruction before the
snippet loads EAX from memory. Oh yes... STC, SUBL may also be faster too.

 I will re-benchmark the whole thing shortly. But before re-benchmark if you
 have time could you fix the benchmark to use the variable pointer and send
 me a new tarball?  For your code it probably doesn't matter because you
 dereference the pointer by hand anyways, but it matters for mine and we want
 to benchmark real world fast path of course.

No, not till this evening now, I'm afraid.

As for real-world benchmarks, I suspect the fastpath is going to be
sufficiently few cycles that it's drowned out by whatever bit of code is
actually using it, like my Wine server module, which is where all this started
for me.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[RFC] gcc compile-time assertions

2001-05-02 Thread David Howells


I'm looking for comments on an idea I've thrashed out with David Woodhouse,
Arjan Van de Ven and Andrew Haley...

I've written a patch for gcc to implement compile-time assertions with an eye
to making use of this in the kernel in the future (assuming I can get it to be
accepted into gcc).

One of the main uses I can see for it is for things like udelay() that need
their arguments range checking. The current method of doing this is by causing
an undefined symbol to be referenced, thereby causing the linker to emit an
error that can be hard to trace.

The gcc patch can be downloaded:

ftp://infradead.org/pub/people/dwh/ctassert.diff

Basically, what I've written is a small extension for gcc that implements
compile-time assertion checking through a new built in function (this has
negligible impact on the rest of the source for gcc).

The assertion function takes two arguments, a condition and a message
string. The return value is an expression condition!=0. The function would
prototype something like:

int __builtin_ct_assert(any-type condition, const char message[])

Additionally, if that expression can be evaluated to a constant of zero at
compile time, an error will be issued that includes the message string in its
text.

The main reason I'd like to see this added to gcc is to help improve the Linux
kernel's robustness by catching certain conditions at compile time. For
instance, Linux's udelay() function (which waits for a number of microseconds
up to a limit of 2uS), is implemented in the i386 architecture thus:

| extern void __bad_udelay(void);
| extern void __udelay(unsigned long usecs);
| extern void __const_udelay(unsigned long usecs);
|
| #define udelay(n) (__builtin_constant_p(n) ? \
|   ((n)  2 ? __bad_udelay() : __const_udelay((n) * 0x10c6ul)) : \
|   __udelay(n))

This relies on __bad_udelay() getting referenced in the program when n is too
large, and causing a linker error, which is quite hard to trace since the
kernel build makes heavy use of incremental linking.

This can be re-implemented using my gcc patch:

| #define udelay(n) ( \
|__builtin_ct_assert((n)=2,udelay() value should be =2uS), \
|__builtin_constant_p(n) ? __const_udelay((n) * 0x10c6ul) : __udelay(n))

And producing an error of the following sort:

| test.c:21: compile-time assertion failed: udelay() value should be =2uS

Cheers,
David Howells
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.4 alpha semaphores optimization

2001-05-04 Thread David Howells

Hello Ivan,

One reason I picked signed long as the count type in the lib/rwsem.c is that
this would be 64 bits on a 64-bit arch such as the alpha.

So I've taken your idea for include/asm-alpha/rwsem.h and modified it a
little. You'll find it attached at the bottom.

I don't know whether it will (a) compile, or (b) work... I don't have an alpha
to play with.

I also don't know the alpha function calling convention, so I can't put direct
calls to the fallback routines in lib/rwsem.c from the .subsection 2
bits. Can you do that, or can you tell me how the calling convention works?

Cheers,
David

===
#ifndef _ALPHA_RWSEM_H
#define _ALPHA_RWSEM_H

/*
 * Written by Ivan Kokshaysky [EMAIL PROTECTED], 2001.
 * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
 */

#ifndef _LINUX_RWSEM_H
#error please dont include asm/rwsem.h directly, use linux/rwsem.h instead
#endif

#ifdef __KERNEL__

#include linux/list.h
#include linux/spinlock.h

struct rwsem_waiter;

extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);

/*
 * the semaphore definition
 */
struct rw_semaphore {
signed long count;
#define RWSEM_UNLOCKED_VALUE0x
#define RWSEM_ACTIVE_BIAS   0x0001
#define RWSEM_ACTIVE_MASK   0x
#define RWSEM_WAITING_BIAS  (-0x0001)
#define RWSEM_ACTIVE_READ_BIAS  RWSEM_ACTIVE_BIAS
#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
spinlock_t  wait_lock;
struct list_headwait_list;
#if RWSEM_DEBUG
int debug;
#endif
};

#if RWSEM_DEBUG
#define __RWSEM_DEBUG_INIT  , 0
#else
#define __RWSEM_DEBUG_INIT  /* */
#endif

#define __RWSEM_INITIALIZER(name) \
{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \
LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }

#define DECLARE_RWSEM(name) \
struct rw_semaphore name = __RWSEM_INITIALIZER(name)

static inline void init_rwsem(struct rw_semaphore *sem)
{
sem-count = RWSEM_UNLOCKED_VALUE;
spin_lock_init(sem-wait_lock);
INIT_LIST_HEAD(sem-wait_list);
#if RWSEM_DEBUG
sem-debug = 0;
#endif
}

static inline void __down_read(struct rw_semaphore *sem)
{
signed long oldcount, temp;
__asm__ __volatile__(
1: ldq_l %0,%1\n
   addq %0,%3,%2\n
   stq_c %2,%1\n
   beq %2,2f\n
#ifdef CONFIG_SMP
   mb\n
#endif
.subsection 2\n
2: br 1b\n
.previous
:=r (oldcount), =m (sem-count), =r (temp)
:Ir (RWSEM_ACTIVE_READ_BIAS), m (sem-count) : memory);

if (oldcount  0)
rwsem_down_read_failed(sem);
}

static inline void __down_write(struct rw_semaphore *sem)
{
signed long granted, temp;
__asm__ __volatile__(
1: ldq_l %0,%1\n
   addq %0,%3,%2\n
   stq_c %2,%1\n
   beq %2,2f\n
#ifdef CONFIG_SMP
   mb\n
   cmpeq %0,0,%0\n
#endif
.subsection 2\n
2: br 1b\n
.previous
:=r (granted), =m (sem-count), =r (temp)
:Ir (RWSEM_ACTIVE_WRITE_BIAS), m (sem-count) : memory);

if (!granted)
rwsem_down_write_failed(sem);
}

static inline void __up_read(struct rw_semaphore *sem)
{
signed long oldcount, temp;
__asm__ __volatile__(
1: ldq_l %0,%1\n
   subq %0,%3,%2\n
   stq_c %2,%1\n
   beq %2,2f\n
#ifdef CONFIG_SMP
   mb\n
#endif
.subsection 2\n
2: br 1b\n
.previous
:=r (oldcount), =m (sem-count), =r (temp)
:Ir (RWSEM_ACTIVE_READ_BIAS), m (sem-count) : memory);

if (oldcount  0)
if ((count  RWSEM_ACTIVE_MASK) == 0)
rwsem_wake(sem);
}

static inline void __up_write(struct rw_semaphore *sem)
{
signed long count, cmp;
__asm__ __volatile__(
1: ldq_l %0,%1\n
   subq %0,%3,%2\n
   stq_c %2,%1\n
   beq %2,2f\n
#ifdef CONFIG_SMP
   mb\n
   cmpeq %0,%3,%2\n
   subq %0,%3,%0\n
#endif
.subsection 2\n
2: br 1b\n
.previous
:=r (count), =m (sem-count), =r (cmp)
:Ir (RWSEM_ACTIVE_WRITE_BIAS), m (sem-count) : memory);

if (!cmp)
if ((count  RWSEM_ACTIVE_MASK) == 0)
rwsem_wake(sem);
}

#define rwsem_atomic_add(val, sem)  atomic_add(val, (sem)-count)
#define rwsem_atomic_update(val, sem)   atomic_add_return(val, (sem)-count)

#endif /* __KERNEL__ */
#endif /* 

Re: 2.4.4 rwsem make error, please somebody take a look !!

2001-05-08 Thread David Howells


What arch are you compiling for? i386?

  When I compiled bzImage (using .config from 2.2.3) I got the following
  errors:

Did you run one of the make config commands before building the kernel? You
may need to do this to flush the changes from arch/*/config.in into .config.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PTRACE_ATTACH breaks wait4()

2001-06-08 Thread David Howells

I have an idea for getting around this problem, but it's only half implemented
at the moment (I use it for implementing a Wine server in the kernel). It
involves having a list things called task ornaments attached to each
process. Each ornament has a table of event notification methods (so it can be
informed about fork, execve, signal and exit events). Signal event
notification methods are able to prevent a signal from propegating further
down the chain, and the parent's wait handler would be the last element in
this list. When a process attaches with ptrace(), it would insert another
ornament into this list, before the parent's ornament. This means (a) the
process doesn't have to be reparented, and (b) more than one debugger can
actually attach to a process (eg: strace and gdb both).

This would, however, mean that wait*() would have to not only look at a
process's list of children, but also it's list of processes it has ornamented
via ptrace

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[RFC] I/O Access Abstractions

2001-06-28 Thread David Howells


In conjunction with David Woodhouse ([EMAIL PROTECTED]) and Arjan Van De Ven
([EMAIL PROTECTED]), I've come up with a way to abstract I/O accesses in the
Linux kernel whilst trying to keep overheads minimal. These would be
particularly useful on many non-i386 platforms.

Any comments would be greatly appreciated.

The example code I've written can be downloaded from:

ftp://infradead.org/pub/people/dwh/ioaccess.tar.bz2

Cheers,
David Howells
[EMAIL PROTECTED]
___

   I/O ACCESS ABSTRACTIONS
   ===

PURPOSES


 * Makes all peripheral input and output accesses look the same.

   - I/O ports
   - Memory-mapped I/O registers
   - PCI Configuration space
   - Device Specific registers

 * Can hide all the strange hoops that some architectures have to jump through
   to provide certain bus access services. For example:

   (1) The SH arch has no I/O port space in the same way that the i386 bus
   does. It instead maps a window in the physical memory space to PCI IO
   port accesses.

   (2) The AM33 arch also has no I/O port space, and likewise maps a window in
   memory space to PCI IO port addresses. Furthermore, the PCI _memory_
   space doesn't correspond on 1:1 basis with the CPU's memory space. At
   any one time, a 64Mb window is mapped from a fixed location in the CPU
   memory space into a mobile location on the PCI memory space. This can
   be moved by means of a control register on the host bridge.

 * Can hide the exact details of the layout of a devices register space as it
   varies from arch to arch. For instance, on the PC, a the 16550 compatible
   serial ports have 8 1-byte registers adjacent to each other in the I/O port
   space, whereas another arch may have the same 8 1-byte registers but at
   intervals of 4-bytes in memory space.

 * The operations by which a device can be accessed can be bound at runtime
   rather than at compile time without the need for conditional branches (this
   should simplify serial.c immensely).

 * Permit iotrace on specific resources by operation table substitution.

 * Provide a method of device emulation.

 * Potentially permit transparent byte swapping.


DISADVANTAGES
=

 * Indirection. It will incur an overhead of a few extra cycles on the i386
   (for the call and return). This can be minimised by carefully crafting bits
   of inline assembly and carefully choosing what registers are used to pass
   what values. This may also be offset under certain circumstances by the
   removal of conditional branches (for example in serial.c).

   Also on an i386, the actual I/O instruction itself is going to take a
   comparatively long time anyway, given the speed differential between CPU
   and external buses.

   On other archs where the indirection exists anyway, there shouldn't be any
   overhead


IMPLEMENTATION
==

 * The resource structure gains a pointer to a table of operations.

 * The table of resource operations includes 3 single-input functions, 3
   single-output functions, 3 string-input functions and 3 string-output
   functions. Each set of functions includes variation for byte, word and
   dword granularity.

 * Inline functions are provided on a per-arch basis that take a resource
   structure amongst their arguments, dereference it to pull out the
   appropriate function address, set up the registers and then call that
   function. For example:

struct resource *csr;
__u32 x = inputw(csr,0x10);
outputw(cst,0x10,x|0x0020);

 * Macros are provided for generating and using calling convention translation
   stubs for making it possible to write operation functions in C (which if
   necessary will be assembly functions):

__u16 __iocall pcnet32_inputw(struct resource *p, unsigned offset);
__DECLARE_INPUT_CALLBACK(w,pcnet32_inputw);
struct resource_ops pcnet32_ops = {
__USE_INPUT_CALLBACK(b,pcnet32_inputb),
__USE_INPUT_CALLBACK(w,pcnet32_inputw),
__USE_INPUT_CALLBACK(l,pcnet32_inputl),
};

   Note that on most archs (where there are enough registers) these macros
   will do nothing but plug the user-supplied function straight in.

 * There are three ops tables supplied for the i386:

   - I/O port accesses  [ioaccess-io.S]
   - memory-mapped I/O accesses [ioaccess-mem.S]
   - PCI type 1 config space accesses   [ioaccess-pciconf1.S]

   Note that the PCI access example does not have the complete set of
   functions at the moment.


EXTENSIONS
==

 * Driver-specific operations can be provided (for example CSR register access
   functions in the AMD PCnet32 driver).


ISSUES
==

Having discussed this with others the following points have been raised:

 * Some i386 support routines

[BUG] directory renaming/removal

2001-02-02 Thread David Howells


Run the following script (It's been tried on linux-2.2.x and linux-2.4.x):

#!/bin/sh
cd /tmp
mkdir x
cd x
mkdir x y z
strace -etrace=rename,mkdir,rmdir,chmod mv x z
echo -
chmod -w y
strace -etrace=rename,mkdir,rmdir,chmod mv y z

The output:

rename("x", "z/x")  = 0
-
rename("y", "z/y")  = -1 EACCES (Permission denied)
mkdir("z/y", 040755)= 0
chmod("z/y", 040555)= 0
rmdir("y")  = 0

You'll notice the following:

 (1) Linux can't rename directories that are marked as read-only. This is
 strange because the directories actually being modified _do_ have write
 permission.

 (2) You can _remove_ a read-only directory.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PCI-SCI Drivers v1.1-7 released

2001-02-07 Thread David Howells


 More to add on the gcc 2.96 problems.  After compiling a Linux 2.4.1 
 kernel on gcc 2.91, running SCI benchmarks, then compiling on RedHat 
 7.1 (Fischer) with gcc 2.96, the 2.96 build DROPPED 30% in throughput
 from the gcc 2.91 compiled version on the identical SAME 2.4.1 
 source tree. 

Out of interest, could you run your benchmark test against a "latest snapshot
build" of gcc?

http://www.codesourcery.com/gcc-snapshots/

Cheers,
David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread David Howells


Linus Torvalds [EMAIL PROTECTED] wrote:
 Actually, I'd rather leave it in, but speed it up with the saner and
 faster

   if (bh-b_size  (correct_size-1)) {

I presume that correct_size will always be a power of 2...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: 2.4.x SMP blamed for Xfree 4.0 crashes

2001-02-13 Thread David Howells

I had problems with XFree86 4.0 and 4.0.1 locking solidly up under Linux 2.4.x
after about 10mins until I upgraded to XFree86 4.0.2. Now it seems rock solid.
XFree86 3.3.x was always okay.

I've got a Dual-PII machine and an NVidia GeForce.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



*grin* Windows 2000 HPC: Scalable, Inexpensive Supercomputing Solutions

2001-02-14 Thread David Howells


How this for a laugh:

http://www.microsoft.com/WINDOWS2000/hpc/indstand.asp

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [LONG RANT] Re: Linux stifles innovation...

2001-02-19 Thread David Howells


I suspect part of the problem with commercial driver support on Linux is that
the Linux driver API (such as it is) is relatively poorly documented and seems
to change almost on a week-by-week basis anyway. I've done my share of chasing
the current kernel revision with drivers that aren't part of the kernel tree:
by the time you update the driver to work with the current kernel revision,
there's a new one out, and the driver doesn't compile with it.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



rw_semaphore bug

2001-04-04 Thread David Howells

I've found a bug in the write path of the read/write semaphore stuff (at least
for the i386 arch). The attached kernel module (rwsem.c) and driver program
(driver.c) demonstrate it.

What happens is that once the driver finishes, you end up with a whole load of
processes forked off of driver that are stuck in the D state and are waiting
in down_write_failed(), even though the semaphore is in the unlocked state.


#define __NO_VERSION__
#include linux/config.h
#include linux/module.h
#include linux/poll.h
#include linux/proc_fs.h
#include linux/stat.h
#include linux/init.h
#include linux/personality.h
#include linux/smp_lock.h
#include linux/delay.h

struct proc_dir_entry *rwsem_proc;

struct rw_semaphore rwsem_sem;

static int rwsem_write_proc(struct file *file, const char *buffer, unsigned long 
count, void *data)
{
printk("[%05d %08x]: downing\n",current-pid,rwsem_sem.count);
down_write(rwsem_sem);
printk("[%05d %08x]: downed\n",current-pid,rwsem_sem.count);
mdelay(1);
printk("[%05d %08x]: upping\n",current-pid,rwsem_sem.count);
up_write(rwsem_sem);
printk("[%05d %08x]: upped\n",current-pid,rwsem_sem.count);

mdelay(100);

printk("[%05d %08x]: downing\n",current-pid,rwsem_sem.count);
down_write(rwsem_sem);
printk("[%05d %08x]: downed\n",current-pid,rwsem_sem.count);
mdelay(1);
printk("[%05d %08x]: upping\n",current-pid,rwsem_sem.count);
up_write(rwsem_sem);
printk("[%05d %08x]: upped\n",current-pid,rwsem_sem.count);

return -ENOENT;
}

static int __init rwsem_init_module(void)
{
printk(KERN_INFO "rwsem loading...\n");

init_rwsem(rwsem_sem);

/* try and create the access point */
rwsem_proc = create_proc_entry("rwsem",S_IRUGO|S_IWUGO,NULL);
if (!rwsem_proc)
return -EEXIST;

rwsem_proc-write_proc = rwsem_write_proc;
return 0;
}

static void __exit rwsem_cleanup_module(void)
{
printk(KERN_INFO "rwsem unloading\n");

remove_proc_entry("rwsem",NULL);

}

module_init(rwsem_init_module);
module_exit(rwsem_cleanup_module);


#include stdio.h
#include stdlib.h
#include unistd.h
#include sys/wait.h
#include fcntl.h

int main(int argc, char *argv[])
{
int loop, fd, tmp;

fd = open("/proc/rwsem",O_RDWR);
if (fd0) {
perror("open");
return 1;
}

for (loop=0; loop50; loop++) {
switch (fork()) {
case -1:
perror("fork");
return 1;
case 0:
write(fd," ",1);
exit(1);
default:
break;
}
}

for (loop=0; loop5; loop++) {
if (wait(tmp)0) {
perror("wait");
return 1;
}
}

return 0;
}



Re: rw_semaphores

2001-04-10 Thread David Howells


Since you're willing to use CMPXCHG in your suggested implementation, would it
make it make life easier if you were willing to use XADD too?

Plus, are you really willing to limit the number of readers or writers to be
32767? If so, I think I can suggest a way that limits it to ~65535 apiece
instead...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] i386 rw_semaphores fix

2001-04-10 Thread David Howells

Here's a patch that fixes RW semaphores on the i386 architecture. It is very
simple in the way it works.

The lock counter is dealt with as two semi-independent words: the LSW is the
number of active (granted) locks, and the MSW, if negated, is the number of
active writers (0 or 1) plus the number of waiting lockers of any type.

The fast paths are either two or three instructions long.

This algorithm should also be totally fair! Contentious lockers get put on the
back of the wait queue, and a waker function wakes them starting at the front,
but only wakes either one writer or the first consecutive bundle of readers.

The disadvantage is that the maximum number of active locks is 65535, and the
maximum number of waiting locks is 32766 (this can be extended to 65534 by not
relying on the S flag).

I've included a revised testing module (rwsem.c) that allows read locks to be
obtained as well as write locks and a revised driver program (driver.c) that
can use rwsem.c. Try the following tests:

driver -200  driver 200 # fork 200 writers and then 200 readers
driver 200  driver -200 # fork 200 readers and then 200 writers

David Howells


diff -uNr linux-2.4.3/arch/i386/kernel/semaphore.c linux/arch/i386/kernel/semaphore.c
--- linux-2.4.3/arch/i386/kernel/semaphore.cThu Apr  5 14:38:34 2001
+++ linux/arch/i386/kernel/semaphore.c  Tue Apr 10 18:23:55 2001
@@ -14,7 +14,6 @@
  */
 #include linux/config.h
 #include linux/sched.h
-
 #include asm/semaphore.h
 
 /*
@@ -237,19 +236,10 @@
 __down_read_failed:
pushl   %edx
pushl   %ecx
-   jnc 2f
-
-3: calldown_read_failed_biased
-
-1: popl%ecx
+   calldown_read_failed
+   popl%ecx
popl%edx
ret
-
-2: calldown_read_failed
-   " LOCK "subl$1,(%eax)
-   jns 1b
-   jnc 2b
-   jmp 3b
 "
 );
 
@@ -260,169 +250,204 @@
 __down_write_failed:
pushl   %edx
pushl   %ecx
-   jnc 2f
-
-3: calldown_write_failed_biased
-
-1: popl%ecx
+   calldown_write_failed
+   popl%ecx
popl%edx
ret
-
-2: calldown_write_failed
-   " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax)
-   jz  1b
-   jnc 2b
-   jmp 3b
 "
 );
 
-struct rw_semaphore *FASTCALL(rwsem_wake_readers(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(rwsem_wake_writer(struct rw_semaphore *sem));
+asm(
+"
+.align 4
+.globl __rwsem_wake
+__rwsem_wake:
+   pushl   %edx
+   pushl   %ecx
+   callrwsem_wake
+   popl%ecx
+   popl%edx
+   ret
+"
+);
 
-struct rw_semaphore *FASTCALL(down_read_failed_biased(struct rw_semaphore *sem));
-struct rw_semaphore *FASTCALL(down_write_failed_biased(struct rw_semaphore *sem));
+struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *sem));
 struct rw_semaphore *FASTCALL(down_read_failed(struct rw_semaphore *sem));
 struct rw_semaphore *FASTCALL(down_write_failed(struct rw_semaphore *sem));
 
-struct rw_semaphore *down_read_failed_biased(struct rw_semaphore *sem)
-{
-   struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
 
-   add_wait_queue(sem-wait, wait);  /* put ourselves at the head of the 
list */
-
-   for (;;) {
-   if (sem-read_bias_granted  xchg(sem-read_bias_granted, 0))
-   break;
-   set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-   if (!sem-read_bias_granted)
-   schedule();
-   }
+static inline int atomic_add_and_read_orig(int delta, atomic_t *v)
+{
+   int oldmem;
 
-   remove_wait_queue(sem-wait, wait);
-   tsk-state = TASK_RUNNING;
+   __asm__ __volatile__(
+   LOCK_PREFIX "xadd %0,%2"
+   : "=r"(oldmem)
+   : "r0"(delta), "m"(*__xg(v))
+   : "memory");
 
-   return sem;
+   return oldmem;
 }
 
-struct rw_semaphore *down_write_failed_biased(struct rw_semaphore *sem)
+static inline int atomic_add_and_read(int delta, atomic_t *v)
 {
-   struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
-
-   add_wait_queue_exclusive(sem-write_bias_wait, wait); /* put ourselves at 
the end of the list */
-
-   for (;;) {
-   if (sem-write_bias_granted  xchg(sem-write_bias_granted, 0))
-   break;
-   set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-   if (!sem-write_bias_granted)
-   schedule();
-   }
-
-   remove_wait_queue(sem-write_bias_wait, wait);
-   tsk-state = TASK_RUNNING;
+   return atomic_add_and_read_orig(delta,v)+delta;
+}
 
-   /* if the lock is currently unbiased, awaken the sleepers
-* FIXME: this wakes up the readers early in a bit of a
-* stampede - bad!
-

Re: [PATCH] i386 rw_semaphores fix

2001-04-11 Thread David Howells

Can CONFIG_X86_XADD be equated to CONFIG_X86_CMPXCHG?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] 2nd try: i386 rw_semaphores fix

2001-04-11 Thread David Howells
usive.
+/*
+ * wait for the write lock to be granted
  */
 struct rw_semaphore *down_write_failed(struct rw_semaphore *sem)
 {
struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   DECLARE_WAITQUEUE(wait,tsk);
+   int count;
 
-   __up_write(sem);/* this takes care of granting the lock */
+   rwsemdebug("[%d] Entering down_write_failed(%08x)\n",
+  current-pid,atomic_read(sem-count));
 
-   add_wait_queue_exclusive(sem-wait, wait);
+   /* this waitqueue context flag will be cleared when we are granted the lock */
+   __set_bit(RWSEM_WAITING_FOR_WRITE,wait.flags);
+
+   add_wait_queue_exclusive(sem-wait, wait); /* FIFO */
 
-   while (atomic_read(sem-count)  0) {
+   /* note that we're waiting on the lock, but no longer actively locking */
+   count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
+   rwsemdebug("A(%08x)\n",count);
+
+   /* if there are no longer active locks, wake the front queued process(es) up
+* - it might even be this process, since the waker takes a more active part
+*/
+   if (!(count  RWSEM_ACTIVE_MASK))
+   rwsem_wake(sem);
+
+   /* wait to be given the lock */
+   for (;;) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-   if (atomic_read(sem-count) = 0)
-   break;  /* we must attempt to acquire or bias the lock */
+   if (!test_bit(RWSEM_WAITING_FOR_WRITE,wait.flags))
+   break;
schedule();
}
 
-   remove_wait_queue(sem-wait, wait);
+   remove_wait_queue(sem-wait,wait);
tsk-state = TASK_RUNNING;
 
+   rwsemdebug("[%d] Leaving 
+down_write_failed(%08x)\n",current-pid,atomic_read(sem-count));
+
return sem;
 }
 
-asm(
-"
-.align 4
-.globl __rwsem_wake
-__rwsem_wake:
-   pushl   %edx
-   pushl   %ecx
-
-   jz  1f
-   callrwsem_wake_readers
-   jmp 2f
+/*
+ * handle the lock being released whilst there are processes blocked on it that can 
+now run
+ * - if we come here, then:
+ *   - the 'active part' of the count (0x) reached zero (but may no longer 
+be zero)
+ *   - the 'waiting part' of the count (0x) is negative (and will still be 
+so)
+ */
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+   int woken;
 
-1: callrwsem_wake_writer
+   rwsemdebug("[%d] Entering 
+rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count));
 
-2: popl%ecx
-   popl%edx
-   ret
-"
-);
+   /* try to grab an 'activity' marker
+* - need to make sure two copies of rwsem_wake() don't do this for two 
+separate processes
+*   simultaneously
+* - be horribly naughty, and only deal with the LSW of the atomic counter
+*/
+   if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0)
+   goto out;
 
-/* Called when someone has done an up that transitioned from
- * negative to non-negative, meaning that the lock has been
- * granted to whomever owned the bias.
- */
-struct rw_semaphore *rwsem_wake_readers(struct rw_semaphore *sem)
-{
-   if (xchg(sem-read_bias_granted, 1))
+   /* try to grant a single write lock if there's a writer at the front of the 
+queue
+* - note we leave the 'active part' of the count incremented by 1 and the 
+waiting part
+*   incremented by 0x0001
+*/
+   switch (wake_up_ctx(sem-wait,1,-RWSEM_WAITING_FOR_WRITE)) {
+   case 1:
+   goto out;
+   case 0:
+   break;
+   default:
BUG();
-   wake_up(sem-wait);
-   return sem;
-}
+   }
 
-struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem)
-{
-   if (xchg(sem-write_bias_granted, 1))
+   rwsemdebug("E\n");
+
+   /* grant an infinite number of read locks to the readers at the front of the 
+queue
+* - note we increment the 'active part' of the count by the number of readers 
+just woken,
+*   less one for the activity decrement we've already done
+*/
+   woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ);
+   if (woken0) {
+   woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
+   woken -= RWSEM_ACTIVE_BIAS;
+   rwsem_atomic_update(woken,sem);
+   }
+   else {
BUG();
-   wake_up(sem-write_bias_wait);
+   }
+
+ out:
+   rwsemdebug("[%d] Leaving 
+rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count));
return sem;
 }
 
+/*
+ * rw spinlock fallbacks
+ */
 #if defined(CONFIG_SMP)
 asm(
 "
@@ -451,4 +539,3 @@
 "
 );
 #endif
-
diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h linux/include/asm-i386/rwsem-spin.h
--- linux-2.4.3/include/asm-i386/rwsem-spin.h   Thu Jan  1 01:00:00 1970
+++ linux/include/asm-i386

Re: [PATCH] 2nd try: i386 rw_semaphores fix

2001-04-11 Thread David Howells

 I'd like you to look over it. It seems newer GCC's (snapshots and the
 upcoming 3.0) will be more strict when modifying some values through
 assembler-passed pointers - in this case, the passed semaphore structure got
 freed too early, causing massive stack corruption on early bootup.

 The solution was to directly mention the modified element (in this case,
 sem-count) with a "=m" qualifier, which told GCC that the contents of the
 semaphore structure are still really needed. It does not seem to have any
 bad side effects on older GCC, but lets the code work on people trying to
 use the newer snapshots.

I've just consulted with one of the gcc people we have here, and he says that
the '"memory"' constraint should do the trick.

Do I take it that that is actually insufficient?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] 2nd try: i386 rw_semaphores fix

2001-04-11 Thread David Howells

I've been discussing it with some other kernel and GCC people, and they think
that only "memory" is required.

 What are the reasons against mentioning sem-count directly as a "=m" 
 reference? This makes the whole thing less fragile and no more dependent
 on the memory layout of the structure.

Apart from the risk of breaking it, you mean? Well, "=m" seems to reserve an
extra register to hold a second copy of the semaphore address, probably since
it thinks EAX might get clobbered.

Also, as a minor point, it probably ought to be "+m" not "=m".

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] 3rd try: i386 rw_semaphores fix

2001-04-11 Thread David Howells
uct rw_semaphore *down_write_failed(struct rw_semaphore *sem)
 {
struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   DECLARE_WAITQUEUE(wait,tsk);
+   int count;
 
-   __up_write(sem);/* this takes care of granting the lock */
+   rwsemdebug("[%d] Entering down_write_failed(%08x)\n",
+  current-pid,atomic_read(sem-count));
 
-   add_wait_queue_exclusive(sem-wait, wait);
+   /* this waitqueue context flag will be cleared when we are granted the lock */
+   __set_bit(RWSEM_WAITING_FOR_WRITE,wait.flags);
+
+   add_wait_queue_exclusive(sem-wait, wait); /* FIFO */
 
-   while (atomic_read(sem-count)  0) {
+   /* note that we're waiting on the lock, but no longer actively locking */
+   count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
+   rwsemdebug("A(%08x)\n",count);
+
+   /* if there are no longer active locks, wake the front queued process(es) up
+* - it might even be this process, since the waker takes a more active part
+*/
+   if (!(count  RWSEM_ACTIVE_MASK))
+   rwsem_wake(sem);
+
+   /* wait to be given the lock */
+   for (;;) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-   if (atomic_read(sem-count) = 0)
-   break;  /* we must attempt to acquire or bias the lock */
+   if (!test_bit(RWSEM_WAITING_FOR_WRITE,wait.flags))
+   break;
schedule();
}
 
-   remove_wait_queue(sem-wait, wait);
+   remove_wait_queue(sem-wait,wait);
tsk-state = TASK_RUNNING;
 
+   rwsemdebug("[%d] Leaving 
+down_write_failed(%08x)\n",current-pid,atomic_read(sem-count));
+
return sem;
 }
 
-asm(
-"
-.align 4
-.globl __rwsem_wake
-__rwsem_wake:
-   pushl   %edx
-   pushl   %ecx
-
-   jz  1f
-   callrwsem_wake_readers
-   jmp 2f
+/*
+ * handle the lock being released whilst there are processes blocked on it that can 
+now run
+ * - if we come here, then:
+ *   - the 'active part' of the count (0x) reached zero (but may no longer 
+be zero)
+ *   - the 'waiting part' of the count (0x) is negative (and will still be 
+so)
+ */
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+   int woken;
 
-1: callrwsem_wake_writer
+   rwsemdebug("[%d] Entering 
+rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count));
 
-2: popl%ecx
-   popl%edx
-   ret
-"
-);
+   /* try to grab an 'activity' marker
+* - need to make sure two copies of rwsem_wake() don't do this for two 
+separate processes
+*   simultaneously
+* - be horribly naughty, and only deal with the LSW of the atomic counter
+*/
+   if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0)
+   goto out;
 
-/* Called when someone has done an up that transitioned from
- * negative to non-negative, meaning that the lock has been
- * granted to whomever owned the bias.
- */
-struct rw_semaphore *rwsem_wake_readers(struct rw_semaphore *sem)
-{
-   if (xchg(sem-read_bias_granted, 1))
+   /* try to grant a single write lock if there's a writer at the front of the 
+queue
+* - note we leave the 'active part' of the count incremented by 1 and the 
+waiting part
+*   incremented by 0x0001
+*/
+   switch (wake_up_ctx(sem-wait,1,-RWSEM_WAITING_FOR_WRITE)) {
+   case 1:
+   goto out;
+   case 0:
+   break;
+   default:
BUG();
-   wake_up(sem-wait);
-   return sem;
-}
+   }
 
-struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem)
-{
-   if (xchg(sem-write_bias_granted, 1))
+   rwsemdebug("E\n");
+
+   /* grant an infinite number of read locks to the readers at the front of the 
+queue
+* - note we increment the 'active part' of the count by the number of readers 
+just woken,
+*   less one for the activity decrement we've already done
+*/
+   woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ);
+   if (woken0) {
+   woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
+   woken -= RWSEM_ACTIVE_BIAS;
+   rwsem_atomic_update(woken,sem);
+   }
+   else {
BUG();
-   wake_up(sem-write_bias_wait);
+   }
+
+ out:
+   rwsemdebug("[%d] Leaving 
+rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count));
return sem;
 }
 
+/*
+ * rw spinlock fallbacks
+ */
 #if defined(CONFIG_SMP)
 asm(
 "
@@ -451,4 +539,3 @@
 "
 );
 #endif
-
diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h 
linux-rwsem/include/asm-i386/rwsem-spin.h
--- linux-2.4.3/include/asm-i386/rwsem-spin.h   Thu Jan  1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem-spin.h   Wed Apr 11 17:18:07 2001
@@ -0,0 +1,228 @@
+/

Re: [PATCH] i386 rw_semaphores fix

2001-04-11 Thread David Howells

Andrew Morton wrote:
 I think that's a very good approach.  Sure, it's suboptimal when there
 are three or more waiters (and they're the right type and order).  But
 that never happens.  Nice design idea.

Cheers.

 These numbers are infinity :)

I know, but I think Linus may be happy with the resolution for the moment. It
can be extended later by siphoning off excess quantities of waiters into a
separate counter (as is done now) and by making the access count use a larger
part of the variable.

Unfortunately, managing the count and siphoned-off count together is tricky.

 You need sterner testing stuff :)  I hit the BUG at the end of rwsem_wake()
 in about a second running rwsem-4.  Removed the BUG and everything stops
 in D state.

 Grab rwsem-4 from
 ...

Will do.

 It's very simple.  But running fully in-kernel shortens the
 code paths enormously and allows you to find those little
 timing windows.

I thought I'd got them all by using an activity counter incremented by both
read and write lockers.

 - rwsemdebug(FMT, ...) doesn't compile with egcs-1.1.2.  Need
 to remove the comma.

This is tricky... you get all sorts of horrible warnings with gcc-2.96 if you
remove the comma. What I've done is now ANSI-C99 compliant, but egcs is not.

 - The comments in down_write and down_read() are inaccurate.
 RWSEM_ACTIVE_WRITE_BIAS is 0x0001, not 0x00010001

Done.

 - It won't compile when WAITQUEUE_DEBUG is turned on. I
 guess you knew that.

Currently putting in separate debugging stuff for rwsems.

 - The comments above the functions in semaphore.h need
 updating.

Done. (BTW in the latest patch, they're actually split out into separate
header files as per Linus's suggestion).

 - What on earth does __xg() do?  (And why do people write
 code like that without explaining why?  Don't answer this
 one).

Stolen from the xchg() macro/function, but I'm not sure what it does. Plus I
don't use it now.

 - Somewhat offtopic: the `asm' statements in semaphore.c
 are really dangerous.

Now all got .text in.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] i386 rw_semaphores fix

2001-04-11 Thread David Howells


 You need sterner testing stuff :)  I hit the BUG at the end of rwsem_wake()
 in about a second running rwsem-4.  Removed the BUG and everything stops
 in D state.
 
 Grab rwsem-4 from
 
   http://www.uow.edu.au/~andrewm/linux/rwsem.tar.gz
 
 It's very simple.  But running fully in-kernel shortens the
 code paths enormously and allows you to find those little
 timing windows.  Run rmsem-4 in two modes: one with
 the schedule() in sched() enabled, and also with it
 commented out.  If it passes that, it works.  When
 you remove the module it'll print out the number of
 read-grants versus write-grants.  If these run at 6:1
 with schedule() disabled then you've kicked butt.
 
 Also, rwsem-4 checks that the rwsems are actually providing
 exclusion between readers and writers, and between
 writers and writers.  A useful thing to check, that.

It now works (patch to follow).

schedule() enabled:
 reads taken: 686273
 writes taken: 193414

schedule() disabled:
 reads taken: 585619
 writes taken: 292997

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] 4th try: i386 rw_semaphores fix

2001-04-11 Thread David Howells
art' of the count by the number of readers 
+just woken,
+*   less one for the activity decrement we've already done
+*/
+   woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ);
+   if (woken=0)
+   goto counter_correction;
+
+   woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
+   woken -= RWSEM_ACTIVE_BIAS;
+   rwsem_atomic_update(woken,sem);
 
-struct rw_semaphore *rwsem_wake_writer(struct rw_semaphore *sem)
-{
-   if (xchg(sem-write_bias_granted, 1))
-   BUG();
-   wake_up(sem-write_bias_wait);
+ out:
+   rwsemdebug("[%d] Leaving 
+rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count));
return sem;
+
+   /* come here if we need to correct the counter for odd SMP-isms */
+ counter_correction:
+   count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
+   rwsemdebug("[%d] corrected(%08x)\n",current-pid,count);
+   if (!(count  RWSEM_ACTIVE_MASK))
+   goto try_again;
+   goto out;
 }
 
+/*
+ * rw spinlock fallbacks
+ */
 #if defined(CONFIG_SMP)
 asm(
 "
@@ -451,4 +549,3 @@
 "
 );
 #endif
-
diff -uNr linux-2.4.3/include/asm-i386/rwsem-spin.h 
linux-rwsem/include/asm-i386/rwsem-spin.h
--- linux-2.4.3/include/asm-i386/rwsem-spin.h   Thu Jan  1 01:00:00 1970
+++ linux-rwsem/include/asm-i386/rwsem-spin.h   Wed Apr 11 22:28:32 2001
@@ -0,0 +1,239 @@
+/* rwsem.h: R/W semaphores based on spinlocks
+ *
+ * Written by David Howells ([EMAIL PROTECTED]).
+ *
+ * Derived from asm-i386/semaphore.h and asm-i386/spinlock.h
+ */
+
+#ifndef _I386_RWSEM_SPIN_H
+#define _I386_RWSEM_SPIN_H
+
+#ifdef __KERNEL__
+
+#define CONFIG_USING_SPINLOCK_BASED_RWSEM 1
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+   atomic_tcount;
+#define RWSEM_UNLOCKED_VALUE   0x
+#define RWSEM_ACTIVE_BIAS  0x0001
+#define RWSEM_ACTIVE_MASK  0x
+#define RWSEM_WAITING_BIAS (-0x0001)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + 
+RWSEM_ACTIVE_BIAS)
+   spinlock_t  lock;
+#define RWSEM_SPINLOCK_OFFSET_STR  "4" /* byte offset of spinlock */
+   wait_queue_head_t   wait;
+#define RWSEM_WAITING_FOR_READ WQ_FLAG_CONTEXT_0   /* bits to use in 
+wait_queue_t.flags */
+#define RWSEM_WAITING_FOR_WRITEWQ_FLAG_CONTEXT_1
+#if RWSEM_DEBUG
+   int debug;
+#endif
+#if RWSEM_DEBUG_MAGIC
+   long__magic;
+   atomic_treaders;
+   atomic_twriters;
+#endif
+};
+
+/*
+ * initialisation
+ */
+#if RWSEM_DEBUG
+#define __RWSEM_DEBUG_INIT  , 0
+#else
+#define __RWSEM_DEBUG_INIT /* */
+#endif
+#if RWSEM_DEBUG_MAGIC
+#define __RWSEM_DEBUG_MINIT(name)  , (int)(name).__magic, ATOMIC_INIT(0), 
+ATOMIC_INIT(0)
+#else
+#define __RWSEM_DEBUG_MINIT(name)  /* */
+#endif
+
+#define __RWSEM_INITIALIZER(name,count) \
+{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \
+   __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \
+   __RWSEM_DEBUG_INIT __RWSEM_DEBUG_MINIT(name) }
+
+#define __DECLARE_RWSEM_GENERIC(name,count) \
+   struct rw_semaphore name = __RWSEM_INITIALIZER(name,count)
+
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,RW_LOCK_BIAS-1)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM_GENERIC(name,0)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+   atomic_set(sem-count, RWSEM_UNLOCKED_VALUE);
+   spin_lock_init(sem-lock);
+   init_waitqueue_head(sem-wait);
+#if RWSEM_DEBUG
+   sem-debug = 0;
+#endif
+#if RWSEM_DEBUG_MAGIC
+   sem-__magic = (long)sem-__magic;
+   atomic_set(sem-readers, 0);
+   atomic_set(sem-writers, 0);
+#endif
+}
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+   __asm__ __volatile__(
+   "# beginning down_read\n\t"
+#ifdef CONFIG_SMP
+LOCK_PREFIX"  decb  "RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* try to grab 
+the spinlock */
+   "  js3f\n" /* jump if failed */
+   "1:\n\t"
+#endif
+   "  incl  (%%eax)\n\t" /* adds 0x0001, returns the old value */
+#ifdef CONFIG_SMP
+   "  movb  $1,"RWSEM_SPINLOCK_OFFSET_STR"(%%eax)\n\t" /* release the 
+spinlock */
+#endif
+   "  js4f\n\t" /* jump if we weren't granted the lock */
+   "2:\n"
+   ".section .text.lock,\"ax\"\n"
+#ifdef CONFIG_SMP
+   "3:\n\t" /* spin on the spinlock till we get it */
+   "  cmpb  $0,"RWSEM_SPINLOCK_OFFSET_STR"

[PATCH] i386 rw_semaphores, general abstraction patch

2001-04-12 Thread David Howells
-
-   return sem;
-}
-
-/*
- * handle the lock being released whilst there are processes blocked on it that can 
now run
- * - if we come here, then:
- *   - the 'active part' of the count (0x) reached zero (but may no longer 
be zero)
- *   - the 'waiting part' of the count (0x) is negative (and will still be 
so)
- */
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-   int woken, count;
-
-   rwsemdebug("[%d] Entering 
rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count));
-
- try_again:
-   /* try to grab an 'activity' marker
-* - need to make sure two copies of rwsem_wake() don't do this for two 
separate processes
-*   simultaneously
-* - be horribly naughty, and only deal with the LSW of the atomic counter
-*/
-   if (rwsem_cmpxchgw(sem,0,RWSEM_ACTIVE_BIAS)!=0) {
-   rwsemdebug("[%d] rwsem_wake: abort wakeup due to renewed 
activity\n",current-pid);
-   goto out;
-   }
-
-   /* try to grant a single write lock if there's a writer at the front of the 
queue
-* - note we leave the 'active part' of the count incremented by 1 and the 
waiting part
-*   incremented by 0x0001
-*/
-   if (wake_up_ctx(sem-wait,1,-RWSEM_WAITING_FOR_WRITE)==1)
-   goto out;
-
-   /* grant an infinite number of read locks to the readers at the front of the 
queue
-* - note we increment the 'active part' of the count by the number of readers 
just woken,
-*   less one for the activity decrement we've already done
-*/
-   woken = wake_up_ctx(sem-wait,65535,-RWSEM_WAITING_FOR_READ);
-   if (woken=0)
-   goto counter_correction;
-
-   woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
-   woken -= RWSEM_ACTIVE_BIAS;
-   rwsem_atomic_update(woken,sem);
-
- out:
-   rwsemdebug("[%d] Leaving 
rwsem_wake(%08x)\n",current-pid,atomic_read(sem-count));
-   return sem;
-
-   /* come here if we need to correct the counter for odd SMP-isms */
- counter_correction:
-   count = rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem);
-   rwsemdebug("[%d] corrected(%08x)\n",current-pid,count);
-   if (!(count  RWSEM_ACTIVE_MASK))
-   goto try_again;
-   goto out;
-}
-
 /*
  * rw spinlock fallbacks
  */
diff -uNr linux-2.4.4-pre2/arch/i386/lib/Makefile linux/arch/i386/lib/Makefile
--- linux-2.4.4-pre2/arch/i386/lib/Makefile Thu Apr 12 08:57:23 2001
+++ linux/arch/i386/lib/MakefileThu Apr 12 10:07:33 2001
@@ -9,7 +9,7 @@
 
 obj-y = checksum.o old-checksum.o delay.o \
usercopy.o getuser.o putuser.o \
-   memcpy.o strstr.o
+   memcpy.o strstr.o rwsem.o
 
 obj-$(CONFIG_X86_USE_3DNOW) += mmx.o
 obj-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o
diff -uNr linux-2.4.4-pre2/arch/i386/lib/rwsem.S linux/arch/i386/lib/rwsem.S
--- linux-2.4.4-pre2/arch/i386/lib/rwsem.S  Thu Jan  1 01:00:00 1970
+++ linux/arch/i386/lib/rwsem.S Thu Apr 12 15:49:25 2001
@@ -0,0 +1,36 @@
+/* rwsem.S: R/W semaphores, register saving wrapper function stubs
+ *
+ * Written by David Howells ([EMAIL PROTECTED]).
+ * Derived from arch/i386/kernel/semaphore.c
+ */
+
+.text
+.align 4
+.globl __rwsem_down_read_failed
+__rwsem_down_read_failed:
+   pushl   %edx
+   pushl   %ecx
+   callrwsem_down_read_failed
+   popl%ecx
+   popl%edx
+   ret
+
+.align 4
+.globl __rwsem_down_write_failed
+__rwsem_down_write_failed:
+   pushl   %edx
+   pushl   %ecx
+   callrwsem_down_write_failed
+   popl%ecx
+   popl%edx
+   ret
+
+.align 4
+.globl __rwsem_wake
+__rwsem_wake:
+   pushl   %edx
+   pushl   %ecx
+   callrwsem_wake
+   popl%ecx
+   popl%edx
+   ret
diff -uNr linux-2.4.4-pre2/include/asm-i386/rwsem-spin.h 
linux/include/asm-i386/rwsem-spin.h
--- linux-2.4.4-pre2/include/asm-i386/rwsem-spin.h  Thu Apr 12 08:57:28 2001
+++ linux/include/asm-i386/rwsem-spin.h Thu Apr 12 15:52:13 2001
@@ -8,6 +8,12 @@
 #ifndef _I386_RWSEM_SPIN_H
 #define _I386_RWSEM_SPIN_H
 
+#ifndef _LINUX_RWSEM_H
+#error please dont include asm/rwsem-spin.h directly, use linux/rwsem.h instead
+#endif
+
+#include linux/spinlock.h
+
 #ifdef __KERNEL__
 
 #define CONFIG_USING_SPINLOCK_BASED_RWSEM 1
@@ -16,7 +22,7 @@
  * the semaphore definition
  */
 struct rw_semaphore {
-   atomic_tcount;
+   signed long count;
 #define RWSEM_UNLOCKED_VALUE   0x
 #define RWSEM_ACTIVE_BIAS  0x0001
 #define RWSEM_ACTIVE_MASK  0x
@@ -53,7 +59,7 @@
 #endif
 
 #define __RWSEM_INITIALIZER(name,count) \
-{ ATOMIC_INIT(RWSEM_UNLOCKED_VALUE), SPIN_LOCK_UNLOCKED, \
+{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
__WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \
__RWSEM_DEBUG_INIT __RWSEM_DEBUG_MINIT(name) }
 
@@ -66,7 +72,7 @@
 

Re: generic rwsem [Re: Alpha process table hang]

2001-04-17 Thread David Howells

Andrea,

How did you generate the 00_rwsem-generic-1 patch? Against what did you diff?
You seem to have removed all the optimised i386 rwsem stuff... Did it not work
for you?

 (the generic rwsemaphores in those kernels is broken, try to use them in
 other archs or x86 and you will notice) and I cannot reproduce the hang any
 longer.

Can you supply a test case that demonstrates it not working?

 My generic rwsem should be also cleaner and faster than the generic ones in
 2.4.4pre3 and they can be turned off completly so an architecture can really
 takeover with its own asm implementation.

I quick look says it shouldn't be faster (inline functions and all that).

However, I think you might be right about it being too dependent on the
algorithm I put in, and that is easy to change.

 (while with the 2.4.4pre3 design this is obviously not possible because
 lib/rwsem.c compilation isn't conditional and such file knows the internals
 of the struct rw_semaphore).

Could be very easily changed.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-17 Thread David Howells
Robin Getz [EMAIL PROTECTED] wrote:

 David - Does this give you what you need?

Possibly.  I'll have a look at it tomorrow.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-18 Thread David Howells
Aubrey Li [EMAIL PROTECTED] wrote:

 Here, in the attachment I wrote a small test app. Please correct if
 there is anything wrong, and feel free to improve it.

Okay... I have that working... probably.  I don't know what output it's
supposed to produce, but I see this:

# /packet-mmap/sample_packet_mmap
00-00-00-01-00-00-00-8a-00-00-00-8a-00-42-00-50-
38-43-13-a0-00-07-ff-3c-00-00-00-00-00-00-00-00-
00-11-08-00-00-00-00-01-00-01-00-06-00-d0-b7-de-
32-7b-00-00-00-00-00-00-00-00-00-00-00-00-00-00-
00-00-00-90-cc-a2-75-6b-00-d0-b7-de-32-7b-08-00-
45-00-00-7c-00-00-40-00-40-11-b4-13-c0-a8-02-80-
c0-a8-02-8d-08-01-03-20-00-68-8e-65-7f-5b-7e-03-
00-00-00-01-00-00-00-00-00-00-00-00-00-00-00-00-
00-00-00-00-00-00-00-00-00-00-00-01-00-00-81-a4-
00-00-00-01-00-00-00-00-00-00-00-00-00-1d-b8-86-
00-00-10-00-ff-ff-ff-ff-00-00-0e-f0-00-00-09-02-
01-cb-03-16-46-26-38-0d-00-00-00-00-46-26-38-1e-
00-00-00-00-46-26-38-1e-00-00-00-00-00-00-00-00-
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00- [repeated]

Does that look reasonable?

I've attached the preliminary patch.  Note four things about it:

 (1) I've had to add the get_unmapped_area() op to the proto_ops struct, but
 I've only done it for CONFIG_MMU=n as making it available for CONFIG_MMU=y
 could cause problems.

 (2) There's a race between packet_get_unmapped_area() being called and
 packet_mmap() being called.

 (3) I've added an extra check into packet_set_ring() to make sure the caller
 isn't asking for a combination of buffer size and count that will exceed
 ULONG_MAX.  This protects a multiply done elsewhere.

 (4) The entire data buffer is allocated as one contiguous lump in NOMMU-mode.

David

---
[PATCH] NOMMU: Support mmap() on AF_PACKET sockets

From: David Howells [EMAIL PROTECTED]

Support mmap() on AF_PACKET sockets in NOMMU-mode kernels.

Signed-Off-By: David Howells [EMAIL PROTECTED]
---

 include/linux/net.h|7 +++
 include/net/sock.h |8 +++
 net/core/sock.c|   10 
 net/packet/af_packet.c |  118 
 net/socket.c   |   77 +++
 5 files changed, 219 insertions(+), 1 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 4db21e6..9e77cf6 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -161,6 +161,11 @@ struct proto_ops {
int (*recvmsg)   (struct kiocb *iocb, struct socket *sock,
  struct msghdr *m, size_t total_len,
  int flags);
+#ifndef CONFIG_MMU
+   unsigned long   (*get_unmapped_area)(struct file *file, struct socket 
*sock,
+unsigned long addr, unsigned long 
len,
+unsigned long pgoff, unsigned long 
flags);
+#endif
int (*mmap)  (struct file *file, struct socket *sock,
  struct vm_area_struct * vma);
ssize_t (*sendpage)  (struct socket *sock, struct page *page,
@@ -191,6 +196,8 @@ extern int   sock_sendmsg(struct socket *sock, 
struct msghdr *msg,
 extern int  sock_recvmsg(struct socket *sock, struct msghdr *msg,
  size_t size, int flags);
 extern int  sock_map_fd(struct socket *sock);
+extern void sock_make_mappable(struct socket *sock,
+   unsigned long prot);
 extern struct socket *sockfd_lookup(int fd, int *err);
 #define sockfd_put(sock) fput(sock-file)
 extern int  net_ratelimit(void);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2c7d60c..d91edea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -841,6 +841,14 @@ extern int  sock_no_sendmsg(struct 
kiocb *, struct socket *,
struct msghdr *, size_t);
 extern int  sock_no_recvmsg(struct kiocb *, struct socket 
*,
struct msghdr *, size_t, int);
+#ifndef CONFIG_MMU
+extern unsigned long   sock_no_get_unmapped_area(struct file *,
+ struct socket *,
+ unsigned long,
+ unsigned long,
+ unsigned long,
+ unsigned long);
+#endif
 extern int sock_no_mmap(struct file *file,
 struct socket *sock,
 struct vm_area_struct *vma);
diff --git a/net/core/sock.c b/net/core/sock.c
index 27c4f62..b288799

Re: [PATCH] fs/afs: Convert to kthread API.

2007-04-19 Thread David Howells
Eric W. Biederman [EMAIL PROTECTED] wrote:

 This patch modifies the startup of kafscmd, kafsasyncd, and kafstimod
 to use kthread_run instead of a combination of kernel_thread and
 daemonize making the code slightly simpler and more maintainable.

Please drop this patch for the moment as I have my own patches to convert them
to keventd-type threads, in addition to implementing a host of other changes.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net/rxrpc: Convert to kthread API.

2007-04-19 Thread David Howells
Eric W. Biederman [EMAIL PROTECTED] wrote:

 This patch modifies the startup of krxtimod, krxiod, and krxsecd
 to use kthread_run instead of a combination of kernel_thread
 and daemonize making the code slightly simpler and more maintainable.

Again, please drop in favour of my RxRPC patches.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-19 Thread David Howells
Aubrey Li [EMAIL PROTECTED] wrote:

 Yes, it's reasonable for me, as long as your
 host IP is 192.168.2.128
 and
 target IP is 192.168.2.141

That is correct, yes:-)

I expect it's an NFS packet as my board is using an NFS root at the moment.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Getting the new RxRPC patches upstream

2007-04-19 Thread David Howells
Eric W. Biederman [EMAIL PROTECTED] wrote:

 What is the ETA on your patches?

That depends on Dave Miller now, I think.  I'm assuming they need to go
through the network GIT tree to get to Linus.  Certainly Andrew Morton seems
to think so.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-19 Thread David Howells
Eric W. Biederman [EMAIL PROTECTED] wrote:

 Ok.  I don't see any patches in -mm so I was assuming these patches have
 not been queued up anywhere.

They haven't been quite yet.  Is it your intention to kill these features in
2.6.22?

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net/rxrpc: Convert to kthread API.

2007-04-20 Thread David Howells
Andrew Morton [EMAIL PROTECTED] wrote:

 Do those patches convert all this code over to full use of the kthread
 API?  Because it seems that a conversion would be straightforward, and
 is needed.

No.  They delete all that code entirely and use workqueues instead.  So, I
suppose merging Eric's patches first should be a simple matter of just
deleting his revised code instead.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-20 Thread David Howells
Aubrey Li [EMAIL PROTECTED] wrote:

 The patch works properly on my side. But
 1) I'm not sure why you re-wrote alloc/free_pg_vec function, doesn't
 the current implement work for NOMMU? I know you want to allocate the
 entire data buffer as one contiguous lump, but is it really necessary?

Yes.  It's not possible to map the whole buffer otherwise.  Think about it!
mmap() returns _one_ reference address.  In MMU-mode, the non-contiguous
physical buffers can be made to appear virtually contiguous by fudging the
page tables and using the MMU.  This is not possible in NOMMU-mode.  The app
will expect the buffer to be one contiguous lump in its address space, and
will not be able to locate the other segments of the buffer.

Actually, what I said is not quite true.  It is possible to map the whole
buffer otherwise: I could lift the restriction that requires that you map the
whole buffer or not at all, and then userspace could stitch the whole lot
together itself.  This would then require userspace to be bimodal.

 2) So the mapped pages doesn't count into NR_FILE_MAPPED, is it a problem?

Not really, no - there are no pagetables.

Furthermore, issuing the PACKET_RX_RING sockopt does the entire allocation.
Any subsequent mmaps on it have little effect.

We could do that accounting though if you think it'd be better.  I don't
suppose it hurts.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-20 Thread David Howells
David Miller [EMAIL PROTECTED] wrote:

 I applied already the patches I thought were appropriate,
 you had some crypto layer changes that you need to work
 out with Herbert Xu before the rest can be applied.

Should the rest of it go via Andrew's tree then?

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-20 Thread David Howells
Aubrey Li [EMAIL PROTECTED] wrote:

 as checked in packet_set_ring, buffer size must be a multiple of PAGE_SIZE,
 packet_set_ring
 if (unlikely(req-tp_block_size  (PAGE_SIZE - 1)))
 
 So why not use __get_free_pages rather than kmalloc,

Because kmalloc() may be able to get us a smaller chunk of memory.  Actually,
calling __get_free_pages() might be a better, and then release the excess
pages.

 so that we have pagetables to count?

There are no pagetables in NOMMU-mode.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-20 Thread David Howells
David Miller [EMAIL PROTECTED] wrote:

 Now that Herbert cleared up the crypto layer issues
 the only problem left is that there are generic changes
 in there which are not strictly networking but which
 your subsequent networking changes depend upon.

 This is a mess, and makes merging your work into the
 net-2.6.22 tree more difficult.

There are only two non-net patches that AF_RXRPC depends on:

 (1) The key facility changes.  That's all my code anyway, and shouldn't be a
 problem to merge unless someone else has put some changes in there that I
 don't know about.

 (2) try_to_cancel_delayed_work().  I suppose I could use
 cancel_delayed_work() instead, but that's less efficient as it waits for
 the timer completion function to finish.

And one that AFS depends on:

 (3) Cache the key in nameidata.  I still don't have Al's agreement on this,
 but it's purely caching, so I could drop that patch for the moment and
 excise the stuff that uses it from my AFS patches if that would help.

Do you class the AFS patches as networking changes?

Do you want me to consolidate my patches to make things simpler for you?

Do you want me to rebase my patches onto net-2.6.22?

I have the following patches, in order, available now, though I haven't yet
released the last few (they can all be downloaded from my RH people pages):

move-skb-generic.diff  (you've got this)
timers.diff
keys.diff
af_rxrpc.diff
afs-cleanup.diff
af_rxrpc-kernel.diff
af_rxrpc-afs.diff
af_rxrpc-delete-old.diff
af_rxrpc-own-workqueues.diff
af_rxrpc-fixes.diff
afs-callback-wq.diff
afs-vlocation.diff
afs-multimount.diff
afs-rxrpc-key.diff
afs-nameidata-key.diff
afs-security.diff
afs-doc.diff
netlink-support-MSG_TRUNC.diff  (you've got this)
afs-get-capabilities.diff
afs-initcallbackstate3.diff
afs-dir-write-support.diff

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-20 Thread David Howells
Eric Dumazet [EMAIL PROTECTED] wrote:

 Is it really possible to allocate an order-10 page, then release part of it
 (say an order-8 subpage) ?

Yes.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-23 Thread David Howells

 We only care when del_timer() returns true. In that case, if the timer
 function still runs (possible for single-threaded wqs), it has already
 passed __queue_work().

Why do you assume that?

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-24 Thread David Howells
Oleg Nesterov [EMAIL PROTECTED] wrote:

   We only care when del_timer() returns true. In that case, if the timer
   function still runs (possible for single-threaded wqs), it has already
   passed __queue_work().
  
  Why do you assume that?

Sorry, I should have been more clear.  I meant the assumption that we only
care about a true return from del_timer().

 If del_timer() returns true, the timer was pending. This means it was
 started by work-func() (note that __run_timers() clears timer_pending()
 before calling timer-function). This in turn means that
 delayed_work_timer_fn() has already called __queue_work(dwork), otherwise
 work-func() has no chance to run.

But if del_timer() returns 0, then there may be a problem.  We can't tell the
difference between the following two cases:

 (1) The timer hadn't been started.

 (2) The timer had been started, has expired and is no longer pending, but
 another CPU is running its handler routine.

try_to_del_timer_sync() _does_, however, distinguish between these cases: the
first is the 0 return, the second is the -1 return, and the case where it
dequeued the timer is the 1 return.

BTW, can a timer handler be preempted?  I assume not...  But it can be delayed
by interrupt processing.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-24 Thread David Howells
Oleg Nesterov [EMAIL PROTECTED] wrote:

 Let's look at (2). cancel_delayed_work() (on top of del_timer()) returns 0,
 and this is correct, we failed to cancel the timer, and we don't know whether
 work-func() finished, or not.

Yes.

 The current code uses del_timer_sync(). It will also return 0. However, it
 will spin waiting for timer-function() to complete. So we are just wasting
 CPU.

That's my objection to using cancel_delayed_work() as it stands, although in
most cases it's a relatively minor waste of time.  However, if the timer
expiry routine gets interrupted then it may not be so minor...  So, yes, I'm
in full agreement with you there.

 I guess I misunderstood you. Perhaps, you propose a new helper which use
 try_to_del_timer_sync(), yes? Unless I missed something, this doesn't help.
 Because the return value == -1 should be treated as 0. We failed to stop
 the timer, and we can't free dwork.

Consider how I'm using try_to_cancel_delayed_work(): I use this when I want to
queue a delayed work item with a particular timeout (usually for immediate
processing), but the work item may already be pending.

If try_to_cancel_delayed_work() returns 0 or 1 (not pending or pending but
dequeued) then I can go ahead and just schedule the work item (I'll be holding
a lock to prevent anyone else from interfering).

However, if try_to_cancel_delayed_work() returns -1 then there's no usually no
point attempting to schedule the work item because I know the timer expiry
handler is doing that or going to do that.


The code looks like this in pretty much all cases:

if (try_to_cancel_delayed_work(afs_server_reaper) = 0)
schedule_delayed_work(afs_server_reaper, 0);

And so could well be packaged into a convenience routine and placed in
workqueue.[ch].  However, this would still concern Dave Miller as my patches
would still be altering non-net stuff or depending on non-net patches he
doesn't have in his GIT tree.

Using cancel_delayer_work() instead would be acceptable, functionally, as that
just waits till the -1 return case no longer holds true, and so always returns
0 or 1.


In RxRPC, this is only used to cancel a pair global delayed work items in the
rmmod path, and so the inefficiency of cancel_delayed_work() is something I
can live with, though it's something I'd want to reduce in the longer term.

In AFS, this is not only used in object destruction paths, but is also used to
cancel the callback timer and initiate synchronisation processing with
immediate effect.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-24 Thread David Howells
Oleg Nesterov [EMAIL PROTECTED] wrote:

   The current code uses del_timer_sync(). It will also return 0. However,
   it will spin waiting for timer-function() to complete. So we are just
   wasting CPU.
  
  That's my objection to using cancel_delayed_work() as it stands, although in
  most cases it's a relatively minor waste of time.  However, if the timer
  expiry routine gets interrupted then it may not be so minor...  So, yes, I'm
  in full agreement with you there.
 
 Great. I'll send the s/del_timer_sync/del_timer/ patch.

I didn't say I necessarily agreed that this was a good idea.  I just meant that
I agree that it will waste CPU.  You must still audit all uses of
cancel_delayed_work().

 Aha, now I see what you mean. However. Why the code above is better then
 
   cancel_delayed_work(afs_server_reaper);
   schedule_delayed_work(afs_server_reaper, 0);
 
 ? (I assume we already changed cancel_delayed_work() to use del_timer).

Because calling schedule_delayed_work() is a waste of CPU if the timer expiry
handler is currently running at this time as *that* is going to also schedule
the delayed work item.

 If delayed_work_timer_fn() is not running - both variants (let's denote them
 as 1 and 2) do the same.

Yes, but that's not the point.

 Now suppose that delayed_work_timer_fn() is running.
 
   1: lock_timer_base(), return -1, skip schedule_delayed_work().

   2: check timer_pending(), return 0, call schedule_delayed_work(),
  return immediately because test_and_set_bit(WORK_STRUCT_PENDING)
  fails.

I don't see what you're illustrating here.  Are these meant to be two steps in
a single process?  Or are they two alternate steps?

 So I still don't think try_to_del_timer_sync() can help in this particular
 case.

It permits us to avoid the test_and_set_bit() under some circumstances.

 To some extent, try_to_cancel_delayed_work is
 
   int try_to_cancel_delayed_work(dwork)
   {
   ret = cancel_delayed_work(dwork);
   if (!ret  work_pending(dwork-work))
   ret = -1;
   return ret;
   }
 
 iow, work_pending() looks like a more precise indication that work-func()
 is going to run soon.

Ah, but the timer routine may try to set the work item pending flag *after* the
work_pending() check you have here.  Furthermore, it would be better to avoid
the work_pending() check entirely because that check involves interacting with
atomic ops done on other CPUs.  try_to_del_timer_sync() returning -1 tells us
without a shadow of a doubt that the work item is either scheduled now or will
be scheduled very shortly, thus allowing us to avoid having to do it ourself.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting the new RxRPC patches upstream

2007-04-24 Thread David Howells
Oleg Nesterov [EMAIL PROTECTED] wrote:

 Sure, I'll grep for cancel_delayed_work(). But unless I missed something,
 this change should be completely transparent for all users. Otherwise, it
 is buggy.

I guess you will have to make sure that cancel_delayed_work() is always
followed by a flush of the workqueue, otherwise you might get this situation:

CPU 0   CPU 1
=== ===
timer expires
cancel_delayed_work(x) == 0 --delayed_work_timer_fn(x)
kfree(x);   --do_IRQ()
y = kmalloc(); // reuses x
--do_IRQ()
__queue_work(x)
--- OOPS ---

That's my main concern.  If you are certain that can't happen, then fair
enough.

Note that although you can call cancel_delayed_work() from within a work item
handler, you can't then follow it up with a flush as it's very likely to
deadlock.

  Because calling schedule_delayed_work() is a waste of CPU if the timer
  expiry handler is currently running at this time as *that* is going to
  also schedule the delayed work item.
 
 Yes. But otoh, try_to_del_timer_sync() is a waste of CPU compared to
 del_timer(), when the timer is not pending.

I suppose that's true.  As previously stated, my main objection to del_timer()
is the fact that it doesn't tell you if the timer expiry function is still
running.

Can you show me a patch illustrating exactly how you want to change
cancel_delayed_work()?  I can't remember whether you've done so already, but
if you have, I can't find it.  Is it basically this?:

 static inline int cancel_delayed_work(struct delayed_work *work)
 {
int ret;

-   ret = del_timer_sync(work-timer);
+   ret = del_timer(work-timer);
if (ret)
work_release(work-work);
return ret;
 }

I was thinking this situation might be a problem:

CPU 0   CPU 1
=== ===
timer expires
cancel_delayed_work(x) == 0 --delayed_work_timer_fn(x)
schedule_delayed_work(x,0)  --do_IRQ()
keventd scheduled
x-work()
--do_IRQ()
__queue_work(x)

But it won't, will it?

  Ah, but the timer routine may try to set the work item pending flag
  *after* the work_pending() check you have here.
 
 No, delayed_work_timer_fn() doesn't set the _PENDING flag.

Good point.  I don't think that's a problem because cancel_delayed_work()
won't clear the pending flag if it didn't remove a timer.

 First, this is very unlikely event, delayed_work_timer_fn() is very fast
 unless interrupted.

Yeah, I guess so.


Okay, you've convinced me, I think - provided you consider the case I
outlinded at the top of this email.

If you give me a patch to alter cancel_delayed_work(), I'll substitute it for
mine and use that that instead.  Dave Miller will just have to live with that
patch being there:-)

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >