Re: parallel test failures
Tomi Ollila writes: > So, AFAIU, you got 124 since timeout(1) exited with that status (and > killed all parallel(1) executions (after 2 minutes in that case?)... > ... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not > executed and a test hung (probably T355-smime). That sounds right. > In any way you get it again to hung state (w/o using timeout(1) to > mess around) you probably can peek things with ps, /proc, strace, > gdb, or with some other (potentially more sophisticated ;) tools. In fact it looks like I already reported this issue (or a different issue causing T355 to hang, which seems less likely) at id:87h7pxiek3@tethera.net Past me seems to have thought it was some kind of gpgsm failure. I would welcome input from people use or understand gpgsm. d ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: parallel test failures
On Fri, Feb 26 2021, David Bremner wrote: > David Bremner writes: > >> >> Thanks to both of you for your feedback / suggestions. I did read today >> that timeout exits with 124 when the time limit is reached. I haven't >> investigated further (nor do I know how the timelimit should be reached, >> since the whold build+test cycle takes about 10s on this machine. > > Maybe a timeout is not so crazy. I ran a couple of trials with > NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110 > repetitions) in T355-smime, as far as I can tell on the first test. > I'm currently running some trials to see if I can duplicate that without > parallel execution, but that of course takes longer. So, AFAIU, you got 124 since timeout(1) exited with that status (and killed all parallel(1) executions (after 2 minutes in that case?)... ... and when you set NOTMUCH_TEST_TIMEOUT=0 then timeout(1) was not executed and a test hung (probably T355-smime). In any way you get it again to hung state (w/o using timeout(1) to mess around) you probably can peek things with ps, /proc, strace, gdb, or with some other (potentially more sophisticated ;) tools. > > d Tomi ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: parallel test failures
David Bremner writes: > Tomi Ollila writes: > >> >> Anyway, the log.gz did not show any tests failing but parallel exiting >> nonzero possibly for some other reason. Cannot say. Probably stracing (even >> with --seccomp-bpf) would make it happen even less likely :/ >> > > Thanks to both of you for your feedback / suggestions. I did read today > that timeout exits with 124 when the time limit is reached. I haven't > investigated further (nor do I know how the timelimit should be reached, > since the whold build+test cycle takes about 10s on this machine. Maybe a timeout is not so crazy. I ran a couple of trials with NOTMUCH_TEST_TIMEOUT=0, and it eventually hung (after 6, and 110 repetitions) in T355-smime, as far as I can tell on the first test. I'm currently running some trials to see if I can duplicate that without parallel execution, but that of course takes longer. d ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: parallel test failures
Tomi Ollila writes: > > Anyway, the log.gz did not show any tests failing but parallel exiting > nonzero possibly for some other reason. Cannot say. Probably stracing (even > with --seccomp-bpf) would make it happen even less likely :/ > Thanks to both of you for your feedback / suggestions. I did read today that timeout exits with 124 when the time limit is reached. I haven't investigated further (nor do I know how the timelimit should be reached, since the whold build+test cycle takes about 10s on this machine. d ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: parallel test failures
On Fri, Feb 19 2021, David Bremner wrote: > I have intermittent failures when running the test suite on sufficiently > parallel machines. I have attached a log of such a failing build, > although it does not seem especially illuminating. > > It takes anywhere from 5 to 300 runs to get a failure for me running on > 60 hardware threads (30 cores). At least on this machine the number of > tests that pass seems consistent at 1205 I did the following changes to see file write accesses: diff --git a/test/notmuch-test b/test/notmuch-test index b58fd3b3..903a5dff 100755 --- a/test/notmuch-test +++ b/test/notmuch-test @@ -62,13 +62,16 @@ if test -z "$NOTMUCH_TEST_SERIALIZE" && command -v parallel >/dev/null ; then META_FAILURE="parallel test suite returned error code $RES" fi else +rm -rf inw; mkdir inw for test in $TESTS; do +testname=$(basename $test .sh) +inotifywait -d --outfile $PWD/inw/inw-$testname -r -e close_write,delete $PWD/test /tmp $TEST_TIMEOUT_CMD $test "$@" & wait $! +pkill inotifywa # If the test failed without producing results, then it aborted, # so we should abort, too. RES=$? -testname=$(basename $test .sh) if [[ $RES != 0 && ! -e "$NOTMUCH_BUILDDIR/test/test-results/$testname" ]]; then META_FAILURE="Aborting on $testname (returned $RES)" break Then ran tests w/ NOTMUCH_TEST_SERIALIZE=t and then ran for f in inw/*; do echo $f; sed -e 's,.*notmuch/test/, ,' -e '/tmp.T/ s,/.*,,' $f | sort -u; echo; done | less to examine "fallout" based on that (random gazes to the listing) I did not see any potentially overlapping writes, but saw unrelated inconsistency in test directories. Anyway, the log.gz did not show any tests failing but parallel exiting nonzero possibly for some other reason. Cannot say. Probably stracing (even with --seccomp-bpf) would make it happen even less likely :/ Tomi ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: parallel test failures
I did not look at logs, but I have had problem in other scenarios. The way I debugged was to use strace to get a list of all files the tests accessed. From that list I could recognize that some files that should have been in separate temp directories were not thread-specific and solution was to put the temp files in separate dir for each test. Not sure if this is helpful, but wanted to share. Kind regards and best of luck, Xu On Fri, Feb 19, 2021 at 7:24 AM David Bremner wrote: > > > I have intermittent failures when running the test suite on sufficiently > parallel machines. I have attached a log of such a failing build, > although it does not seem especially illuminating. > > It takes anywhere from 5 to 300 runs to get a failure for me running on > 60 hardware threads (30 cores). At least on this machine the number of > tests that pass seems consistent at 1205 > > ___ > notmuch mailing list -- notmuch@notmuchmail.org > To unsubscribe send an email to notmuch-le...@notmuchmail.org ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
parallel test failures
I have intermittent failures when running the test suite on sufficiently parallel machines. I have attached a log of such a failing build, although it does not seem especially illuminating. It takes anywhere from 5 to 300 runs to get a failure for me running on 60 hardware threads (30 cores). At least on this machine the number of tests that pass seems consistent at 1205 log.xz Description: application/xz ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org