On Wed, 23 Feb 2005, Bodo Eggert wrote:

linux-os <[EMAIL PROTECTED]> wrote:

You don't seem to understand. A process that's stuck in 'D' state
shows a SEVERE error, usually with a hardware driver.

Or a network filesystem mount to a no longer existing server or share.


But that's a whole different problem. That's a systemic problem of "fail-over". Network file-systems really need to interface with an intermediate virtual device that can isolate failed systems and make them look "perfect" to individual machines.

If you don't do this, then as soon as somebody trips over a
wire, your database is trashed. I'm surprised that NFS, PCNFS,
SMB, etc., actually work as well as everybody seems to
think they do. Until the architectural problem is resolved,
there are still going to be hung processes, trashed databases,
etc.

For instance,
somebody may have coded something in a critical section that will
wait forever for some bit to be set when, in fact, that bit may
never be set because of a hardware glitch. Such problems must
be found. One can't just suck some process out of the 'D' state.

But you can easily fall into one, e.g. by mounting a SMB share to ~/mnt, working until after the windows box breaks down and trying to save the work of the last hour (which involves enumerating and stat()ing all entries in ~).


Yes. See above.

The 'D' state usually stands for 'Down' where a task
was 'down()' on a semaphore. To get out of that state,
that task (and none other) needs to execute `up()`.
This means that whatever that task was waiting for
needs to happen or it won't call 'up()'.

Maybe the device/mountpoint causing the processes to hang can be declared dead (This is the more important part to me) and/or the syscall can be forced to fail. If it involves wasting some MB of RAM for copying all possibly affected memory in order to avoid corrupting used RAM, that will be the price to pay for not losing your data.


That's not how it's done.

How to clean up the stuck processes: (This requires a MMU)
Add an error path to each syscall (or create some generic error paths) and
keep the original stack frame. On errors, you can "longjump" (not exactly,
but similar) to the error path after copying the memory. The semaphore will
not be taken, and the code depending on the semaphore will not be executed.


Again, you are attacking the symptom. The problem could be resolved by using a local disk (or a disk file) for the immediate I/O and the I/O to the file-servers could occur whenever they are available. It's just ordinary transaction processing. Nothing new. It's just that people continue to use primitive garbage (really, usually developed by amateur hackers with no formal education) that is then specified by the likes of Microsoft and then, to be compatible, other operating systems create clones with the same kinds of unfixable bugs.


BTW: Your Reply-To: should be omited if it's equal to the From:


The problem with From: is this machine is not "known" to the outside world, although somebody has entries in the auth02.ns.uu.net name-server that claims to be my machine, which gets cached and cloned everywhere. Mail to this system needs to go to the Reply-To: address.

Our network "experts" here have tried to track down the
bad name-server entry and they say it's not here.

All of my machine names mysteriously appear in
auth02.ns.uu.net with 204.178.40.nnn IP addresses.
This really screws up email because email tries
to verify the sender by contacting those bogus
addresses.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
 Notice : All mail here is now cached for review by Dictator Bush.
                 98.36% of all statistics are fiction.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to