Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Badari Pulavarty
Hi Andy,

I managed to reproduce the dbench problem. (not sure if its the same
thing or not - but symptoms are same). My problem has nothing to do 
with ext3. I can produce it on ext2, jfs also.

Whats happening on my machine is ..

dbench forks of 4 children and sends them a signal to start the work.
3 out of 4 children gets the signal and does the work. One of the child
never gets the signal so, it waits forever in pause(). So, parent waits
for a longtime to kill it.

BTW, I was trying to find out when this problem started showing up.
So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem
to have this problem). I am going to do bi-sect and find out which
patch caused this.

I am using dbench-2.0 which consistently reproduces the problem on
my x86-64 box. Did you find anything new with your setup ?

Thanks,
Badari



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-24 Thread Linus Torvalds


On Mon, 24 Sep 2007, Badari Pulavarty wrote:
 
 Whats happening on my machine is ..
 
 dbench forks of 4 children and sends them a signal to start the work.
 3 out of 4 children gets the signal and does the work. One of the child
 never gets the signal so, it waits forever in pause(). So, parent waits
 for a longtime to kill it.

Since this *seems* to have nothing to do with the filesystem, and since it 
*seems* to have been introduced between -rc3 and -rc4, I did

gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/

to see what has changed. One of the commits was signal-related, and that 
one doesn't look like it could possibly matter.

The rest were scheduler-related, which doesn't surprise me. In fact, even 
before I looked, my reaction to your bug report was That sounds like an 
application race condition.

Applications shouldn't use pause() for waiting for a signal. It's a 
fundamentally racy interface - the signal could have happened just 
*before* calling pause. So it's almost always a bug to use pause(), and 
any users should be fixed to use sigsuspend() instead, which can 
atomically (and correctly) pause for a signal while the process has masked 
it outside of the system call.

Now, I took a look at the dbench sources, and I have to say that the race 
looks *very* unlikely (there's quite a small window in which it does

children[i].status = getpid();
** race window here **
pause();

and it would require *just* the right timing so that the parent doesn't 
end up doing the sleep(1) (which would make the window even less likely 
to be hit), but there does seem to be a race condition there. And it 
*could* be that you just happen to hit it on your hw setup.

So before you do anything else, does this patch (TOTALLY UNTESTED! DONE 
ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU 
BAD NAMES!) make any difference?

(patch against unmodified dbench-2.0)

Linus

---
diff --git a/dbench.c b/dbench.c
index ccf5624..4be5712 100644
--- a/dbench.c
+++ b/dbench.c
@@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct 
child_struct * ))
 
for (i=0;inprocs;i++) {
if (fork() == 0) {
+   sigset_t old, blocked;
+
+   sigemptyset(blocked);
+   sigaddset(blocked, SIGCONT);
+   sigprocmask(SIG_BLOCK, blocked, old);
setbuffer(stdout, NULL, 0);
nb_setup(children[i]);
children[i].status = getpid();
-   pause();
+   sigsuspend(old);
fn(children[i]);
_exit(0);
}
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.23-rc6: hanging ext3 dbench tests

2007-09-11 Thread Andy Whitcroft
I have a couple of failed test runs against 2.6.23-rc6 where the
job timed out while running dbench over ext3.  Both on powerpc,
though both significantly different hardware setups.  A failed
run like this implies that the machine was still responsive to
other processes but the dbench was making no progress.  There is
no console diagnostics during the failure.

beavis was lost during a plain ext3 dbench run, having just
successfully run a complete ext2 run.  elm3b19 was lost during an
ext3 data=writeback dbench run, having already completed an plain
ext2, and ext3 runs.

A quick poke at the dbench logs on the second machine shows this
for the working ext3 dbench run:

  4 clients started
  4 35288  814.49 MB/sec
  0 62477  822.99 MB/sec
  Throughput 822.954 MB/sec 4 procs

Whereas the hanging run shows the following continuing until the
machine is reset, which confirms that the machine as a whole was
still with us:

  4 clients started
  4 36479  824.92 MB/sec
  1 46857  519.98 MB/sec
  1 46857  346.65 MB/sec
  1 46857  259.99 MB/sec
  1 46857  207.99 MB/sec
  1 46857  173.32 MB/sec
  1 46857  148.56 MB/sec
  1 46857  129.99 MB/sec
  1 46857  115.55 MB/sec
  1 46857  103.99 MB/sec
  1 46857  94.54 MB/sec
  1 46857  86.66 MB/sec
  1 46857  80.00 MB/sec
  [...]

The first machine is very similar:

  4 clients started
  4 18468  445.29 MB/sec
  4 41945  469.36 MB/sec
  1 46857  346.68 MB/sec
  1 46857  260.00 MB/sec
  1 46857  208.00 MB/sec
  [...]

Not sure if there is any significance to the 46857.  Though it feels
like we may be at the end of the run when it fails.

I will try and reproduce this on one of the machines and see if I
can get any further info.

-apw
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc6: hanging ext3 dbench tests

2007-09-11 Thread Andy Whitcroft
Annoyingly this seems to be intermittent, and I have not managed to get
a machine into this state again yet.  Will keep trying.

-apw
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html