Re: 2.6.23-rc6: hanging ext3 dbench tests
Hi Andy, I managed to reproduce the dbench problem. (not sure if its the same thing or not - but symptoms are same). My problem has nothing to do with ext3. I can produce it on ext2, jfs also. Whats happening on my machine is .. dbench forks of 4 children and sends them a signal to start the work. 3 out of 4 children gets the signal and does the work. One of the child never gets the signal so, it waits forever in pause(). So, parent waits for a longtime to kill it. BTW, I was trying to find out when this problem started showing up. So far, I managed to track it to 2.6.23-rc4. (2.6.23-rc3 doesn't seem to have this problem). I am going to do bi-sect and find out which patch caused this. I am using dbench-2.0 which consistently reproduces the problem on my x86-64 box. Did you find anything new with your setup ? Thanks, Badari - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 24 Sep 2007, Badari Pulavarty wrote: Whats happening on my machine is .. dbench forks of 4 children and sends them a signal to start the work. 3 out of 4 children gets the signal and does the work. One of the child never gets the signal so, it waits forever in pause(). So, parent waits for a longtime to kill it. Since this *seems* to have nothing to do with the filesystem, and since it *seems* to have been introduced between -rc3 and -rc4, I did gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/ to see what has changed. One of the commits was signal-related, and that one doesn't look like it could possibly matter. The rest were scheduler-related, which doesn't surprise me. In fact, even before I looked, my reaction to your bug report was That sounds like an application race condition. Applications shouldn't use pause() for waiting for a signal. It's a fundamentally racy interface - the signal could have happened just *before* calling pause. So it's almost always a bug to use pause(), and any users should be fixed to use sigsuspend() instead, which can atomically (and correctly) pause for a signal while the process has masked it outside of the system call. Now, I took a look at the dbench sources, and I have to say that the race looks *very* unlikely (there's quite a small window in which it does children[i].status = getpid(); ** race window here ** pause(); and it would require *just* the right timing so that the parent doesn't end up doing the sleep(1) (which would make the window even less likely to be hit), but there does seem to be a race condition there. And it *could* be that you just happen to hit it on your hw setup. So before you do anything else, does this patch (TOTALLY UNTESTED! DONE ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU BAD NAMES!) make any difference? (patch against unmodified dbench-2.0) Linus --- diff --git a/dbench.c b/dbench.c index ccf5624..4be5712 100644 --- a/dbench.c +++ b/dbench.c @@ -91,10 +91,15 @@ static double create_procs(int nprocs, void (*fn)(struct child_struct * )) for (i=0;inprocs;i++) { if (fork() == 0) { + sigset_t old, blocked; + + sigemptyset(blocked); + sigaddset(blocked, SIGCONT); + sigprocmask(SIG_BLOCK, blocked, old); setbuffer(stdout, NULL, 0); nb_setup(children[i]); children[i].status = getpid(); - pause(); + sigsuspend(old); fn(children[i]); _exit(0); } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.23-rc6: hanging ext3 dbench tests
I have a couple of failed test runs against 2.6.23-rc6 where the job timed out while running dbench over ext3. Both on powerpc, though both significantly different hardware setups. A failed run like this implies that the machine was still responsive to other processes but the dbench was making no progress. There is no console diagnostics during the failure. beavis was lost during a plain ext3 dbench run, having just successfully run a complete ext2 run. elm3b19 was lost during an ext3 data=writeback dbench run, having already completed an plain ext2, and ext3 runs. A quick poke at the dbench logs on the second machine shows this for the working ext3 dbench run: 4 clients started 4 35288 814.49 MB/sec 0 62477 822.99 MB/sec Throughput 822.954 MB/sec 4 procs Whereas the hanging run shows the following continuing until the machine is reset, which confirms that the machine as a whole was still with us: 4 clients started 4 36479 824.92 MB/sec 1 46857 519.98 MB/sec 1 46857 346.65 MB/sec 1 46857 259.99 MB/sec 1 46857 207.99 MB/sec 1 46857 173.32 MB/sec 1 46857 148.56 MB/sec 1 46857 129.99 MB/sec 1 46857 115.55 MB/sec 1 46857 103.99 MB/sec 1 46857 94.54 MB/sec 1 46857 86.66 MB/sec 1 46857 80.00 MB/sec [...] The first machine is very similar: 4 clients started 4 18468 445.29 MB/sec 4 41945 469.36 MB/sec 1 46857 346.68 MB/sec 1 46857 260.00 MB/sec 1 46857 208.00 MB/sec [...] Not sure if there is any significance to the 46857. Though it feels like we may be at the end of the run when it fails. I will try and reproduce this on one of the machines and see if I can get any further info. -apw - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: hanging ext3 dbench tests
Annoyingly this seems to be intermittent, and I have not managed to get a machine into this state again yet. Will keep trying. -apw - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html