Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
Control: reassign -1 clutter-1.0-tests Control: severity -1 serious We were trying so hard to solve this in either gcc-9 or libglib2.0-0 that we didn't consider whether it could be a clutter bug. (It is.) On Fri, 12 Jul 2019 at 11:16:53 +0100, Iain Lane wrote: > Here's the bit of code. > > > https://sources.debian.org/src/clutter-1.0/1.26.2+dfsg-10/tests/conform/actor-offscreen-redirect.c/#L172 > > It's adding some stuff to a main loop and expecting it to finish when a > particular signal handler is called. ... > Things which make it work again > > - Building glib2.0 w/gcc-9 -O1 (and -O0) > - Building w/gcc-8 This appears to have been because building gtestutils.c with different optimizations results in different junk being left on the stack afterwards. When running the clutter test under valgrind, we get: # Start of actor tests # Start of offscreen tests ==13864== Conditional jump or move depends on uninitialised value(s) ==13864==at 0x10AD7C: actor_offscreen_redirect (actor-offscreen-redirect.c:331) ==13864==by 0x10AD7C: actor_offscreen_redirect (actor-offscreen-redirect.c:299) ==13864==by 0x492F889: clutter_test_func_wrapper (clutter-test-utils.c:138) ==13864==by 0x4B6F3BD: ??? (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6000.6) This is the variable 'data' here: > static void > actor_offscreen_redirect (void) > { > Data data; > > ... data.was_painted is never initialized ... > > while (!data.was_painted) > g_main_context_iteration (NULL, FALSE); > } It seems that data.was_painted was intended to be initialized to FALSE (all-zeroes), but this never actually happened. If the uninitialized value of data.was_painted happens to be nonzero, this results in basically the entire test being skipped - we never enter the main loop, and never have the opportunity for the test to hang while waiting for a paint signal that will never happen. Adding some debug code to hexdump the contents of the data struct reveals that gcc-9 -O1, or gcc-9 -O2 with -fno-tree-pre, fairly reliably fills data.was_painted with a nonzero value, so most of the test is effectively never run. gcc-9 -O2 fills it with zeroes, so the test runs. The paint signal never happens (at least in my testing) and the test hangs. The attached is probably a good starting point for someone who has some sort of understanding of Clutter to start to investigate this. smcv diff --git a/tests/conform/actor-offscreen-redirect.c b/tests/conform/actor-offscreen-redirect.c index f47af3617..44b2e43c6 100644 --- a/tests/conform/actor-offscreen-redirect.c +++ b/tests/conform/actor-offscreen-redirect.c @@ -150,25 +150,37 @@ verify_results (Data *data, g_free (pixel); } +static void +paint_handler_cb (GMainLoop *main_loop, + gpointer nil) +{ + g_debug ("in paint_handler_cb"); + g_main_loop_quit (main_loop); +} + static void verify_redraw (Data *data, int expected_paint_count) { GMainLoop *main_loop = g_main_loop_new (NULL, TRUE); guint paint_handler; + g_debug ("in verify_redraw"); + paint_handler = g_signal_connect_data (data->stage, "paint", - G_CALLBACK (g_main_loop_quit), + G_CALLBACK (paint_handler_cb), main_loop, NULL, G_CONNECT_SWAPPED | G_CONNECT_AFTER); /* Queue a redraw on the stage */ + g_debug ("queueing redraw"); clutter_actor_queue_redraw (data->stage); data->foo_actor->paint_count = 0; /* Wait for it to paint */ + g_debug ("running main loop"); g_main_loop_run (main_loop); g_signal_handler_disconnect (data->stage, paint_handler); @@ -181,6 +193,8 @@ run_verify (gpointer user_data) { Data *data = user_data; + g_debug ("in run_verify"); + group_has_overlaps = FALSE; /* By default the actor shouldn't be redirected so the redraw should @@ -298,7 +312,7 @@ run_verify (gpointer user_data) static void actor_offscreen_redirect (void) { - Data data; + Data data = {}; if (!cogl_features_available (COGL_FEATURE_OFFSCREEN)) return;
Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
On Tue, 20 Aug 2019 at 12:21:43 +0200, Matthias Klose wrote: > Control: tags -1 + upstream moreinfo > > now https://gcc.gnu.org/PR91491 That bug report is a bit misleading: the glib2.0 build doesn't fail (it succeeds and passes all its tests!), but it results in a GLib library binary package that makes the clutter tests fail to terminate. clutter is a library that depends on glib2.0. As Iain previously noted, this is very, very far from a minimal reproducer, but it's the only one we have. > On 02.08.19 23:27, Simon McVittie wrote: > > On Fri, 02 Aug 2019 at 19:49:20 +0100, Simon McVittie wrote: > >> If you compile test_run_seed() with -O1, and the rest of gtestutils.c > >> with -O2, the clutter test hangs. > > so building this all with -O2, and with the flags above, it both fails? The "bad" behaviour is that glib2.0 builds successfully, and the glib2.0 tests all pass, but the clutter test does not terminate and eventually fails with a timeout. The "good" behaviour is that everything succeeds. With those definitions, the results were: - Build glib2.0 with gcc-8 -O2: good - Build glib2.0 with gcc-9 -O2: bad - Build glib2.0 with gcc-9 -O1: good - Build glib2.0 with gcc-9 -O2, except test_run_seed() with -O1 via an __attribute__: good - Build glib2.0 with gcc-9 -O2, except test_run_seed() with __attribute__((optimize("no-tree-pre"))): good This was a few gcc-9 revisions ago, so I'll have to check whether subsequent gcc-9 changes have affected the result. > can you tell where exactly the test case loops? It isn't in a busy-loop or anything like that; it's in the GLib event loop (a typical general-purpose event loop based on poll(), similar to libevent/libuv/etc.), waiting for an event that is meant to terminate the event loop by waking it up via an eventfd, but that event never happens. I have no idea why changing the optimization level of test_run_seed() would affect the behaviour of the event loop much later in the program's execution, but it seems like it does. > can you outline how you did setup your test case for that? I can try to do that later. Sorry, it was quite an improvised setup, initially involving hacking the generated ninja.build to use a different compiler for different files, so I cannot easily summarize it in a couple of lines. Again, this is very far from a minimal reproducer but it's the best we have. > can you have a look at the search script in > https://gcc.gnu.org/viewcvs/gcc/branches/gcc-9-branch/gcc/dbgcnt.def?view=markup > and run 'gcc "-fdbg-cnt=treepre_insert:$lb:$ub"' in $cmd? so the whole $cmd > should rebuild gtestutils, perform the link steps and run the test. I'll try. smcv
Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
Control: tags -1 + upstream moreinfo now https://gcc.gnu.org/PR91491 On 02.08.19 23:27, Simon McVittie wrote: > On Fri, 02 Aug 2019 at 19:49:20 +0100, Simon McVittie wrote: >> If you compile test_run_seed() with -O1, and the rest of gtestutils.c >> with -O2, the clutter test hangs. so building this all with -O2, and with the flags above, it both fails? > Binary-searching through the extra optimizations enabled by -O2 [1] > led me to the minimal change being: if you modify test_run_seed() to add > > __attribute__((optimize("no-tree-pre"))) > > then the clutter test passes. Without that attribute it fails. > > I have no idea why. > > smcv > > [1] gcc-9 -Q -O2 --help=optimizers > O2 > gcc-9 -Q -O1 --help=optimizers > O1 > diff -u O1 O2 > can you tell where exactly the test case loops? can you outline how you did setup your test case for that? can you have a look at the search script in https://gcc.gnu.org/viewcvs/gcc/branches/gcc-9-branch/gcc/dbgcnt.def?view=markup and run 'gcc "-fdbg-cnt=treepre_insert:$lb:$ub"' in $cmd? so the whole $cmd should rebuild gtestutils, perform the link steps and run the test.
Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
On Fri, 02 Aug 2019 at 19:49:20 +0100, Simon McVittie wrote: > If you compile test_run_seed() with -O1, and the rest of gtestutils.c > with -O2, the clutter test hangs. Binary-searching through the extra optimizations enabled by -O2 [1] led me to the minimal change being: if you modify test_run_seed() to add __attribute__((optimize("no-tree-pre"))) then the clutter test passes. Without that attribute it fails. I have no idea why. smcv [1] gcc-9 -Q -O2 --help=optimizers > O2 gcc-9 -Q -O1 --help=optimizers > O1 diff -u O1 O2 From: Simon McVittie Date: Fri, 2 Aug 2019 20:13:39 +0100 Subject: Disable an optimization when building with gcc 9 This appears to break the clutter autopkgtest, but I don't know why. Signed-off-by: Simon McVittie Bug: https://bugs.debian.org/931921 --- glib/gtestutils.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/glib/gtestutils.c b/glib/gtestutils.c index c9bc3dd..0908686 100644 --- a/glib/gtestutils.c +++ b/glib/gtestutils.c @@ -1607,6 +1607,9 @@ void test_built_files_dir = test_argv0_dirname; } +#if defined(__GNUC__) && __GNUC__ >= 9 && !defined(__clang__) +__attribute__((optimize("no-tree-pre"))) +#endif static void test_run_seed (const gchar *rseed) {
Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
On Fri, 02 Aug 2019 at 11:41:05 +0100, Iain Lane wrote: > For the record: at Debconf doko suggested to me that a way to start > debugging this from the GCC end would be to produce one working and one > "broken" build of the same version of glib2.0, and then copy object > files from the broken to the working one until that too becomes broken. > That sounded long / tedious / fiddly so we didn't manage to get round to > doing it while we were both there, unfortunately. This is not straightforward in the Meson/Ninja build system, but what I could do was to hack build.ninja to compile different files with different compilers, which narrowed it down to gtestutils.c (?!). Playing with #pragma GCC optimize ("O1") and #pragma GCC optimize ("O2") I was able to narrow it down further, to this: If you compile test_run_seed() with -O1, and the rest of gtestutils.c with -O2, the clutter test hangs. (But I need to test that minimal change in a clean build to make sure that's really true and not being affected by other hacks.) smcv
Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
On Fri, Aug 02, 2019 at 11:26:48AM +0100, Simon McVittie wrote: > On Fri, 12 Jul 2019 at 11:16:53 +0100, Iain Lane wrote: > > I saw this on Ubuntu (9.1.0-8ubuntu1) but I've also reproduced this with > > 9.1.0-8 on sid (w/gcc-defaults from experimental to use gcc-9 by > > default). > > > > clutter's tests hang: > ... > > It's adding some stuff to a main loop and expecting it to finish when a > > particular signal handler is called. > > I can reproduce this with current unstable GLib, 2.60.6-1 (after reverting > the change that makes it explicitly build with gcc 8). > > In the non-working case, it looks as though the paint signal is not > getting emitted at all. The Clutter master clock appears to be the thing > that is meant to be scheduling GLib main-loop events in this case (and > then I got lost and couldn't work out what was wrong). Thanks for reproducing. For the record: at Debconf doko suggested to me that a way to start debugging this from the GCC end would be to produce one working and one "broken" build of the same version of glib2.0, and then copy object files from the broken to the working one until that too becomes broken. That sounded long / tedious / fiddly so we didn't manage to get round to doing it while we were both there, unfortunately. > I have been trying to reproduce this with the upstream GLib source code > so that I can try a build of GLib with ThreadSanitizer or AddressSanitizer, > but so far this has not been successful. Hmm, I thought I tried that and did manage to reproduce. Perhaps that's my memory being slightly faulty. Will try again (probably in a week after I'm back from holidays) if you don't beat me to it. Cheers, -- Iain Lane [ i...@orangesquash.org.uk ] Debian Developer [ la...@debian.org ] Ubuntu Developer [ la...@ubuntu.com ] signature.asc Description: PGP signature
Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
On Fri, 12 Jul 2019 at 11:16:53 +0100, Iain Lane wrote: > I saw this on Ubuntu (9.1.0-8ubuntu1) but I've also reproduced this with > 9.1.0-8 on sid (w/gcc-defaults from experimental to use gcc-9 by > default). > > clutter's tests hang: ... > It's adding some stuff to a main loop and expecting it to finish when a > particular signal handler is called. I can reproduce this with current unstable GLib, 2.60.6-1 (after reverting the change that makes it explicitly build with gcc 8). In the non-working case, it looks as though the paint signal is not getting emitted at all. The Clutter master clock appears to be the thing that is meant to be scheduling GLib main-loop events in this case (and then I got lost and couldn't work out what was wrong). > Things which make it work again > > - Building glib2.0 w/gcc-9 -O1 (and -O0) > - Building w/gcc-8 (obviously) I can confirm that a GLib built from the same source code with gcc-9 -O1 works fine. I have been trying to reproduce this with the upstream GLib source code so that I can try a build of GLib with ThreadSanitizer or AddressSanitizer, but so far this has not been successful. smcv
Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9
Source: gcc-9 Version: 9.1.0-8 Severity: important Control: affects -1 gcc libglib2.0-0 Hiya, I saw this on Ubuntu (9.1.0-8ubuntu1) but I've also reproduced this with 9.1.0-8 on sid (w/gcc-defaults from experimental to use gcc-9 by default). clutter's tests hang like this: 0x77b316f4 in __GI___poll (fds=0x558d0420, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. (gdb) bt #0 0x77b316f4 in __GI___poll (fds=0x558d0420, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x77c62c9e in g_main_context_poll (priority=, n_fds=2, fds=0x558d0420, timeout=, context=0x555b0cd0) at ../glib/gmain.c:4213 #2 g_main_context_iterate (context=0x555b0cd0, block=block@entry=1, dispatch=dispatch@entry=1, self=) at ../glib/gmain.c:3909 #3 0x77c63023 in g_main_loop_run (loop=0x55f01d10) at ../glib/gmain.c:4108 #4 0x6634 in verify_redraw (expected_paint_count=1, data=, data=) at actor-offscreen-redirect.c:172 Here's the bit of code. https://sources.debian.org/src/clutter-1.0/1.26.2+dfsg-10/tests/conform/actor-offscreen-redirect.c/#L172 It's adding some stuff to a main loop and expecting it to finish when a particular signal handler is called. There's obviously a lot of code going on here (this isn't anything like a minimal testcase, and glib2.0's own testsuite & autopkgtests pass so it's not like GMainLoop is completely broken). I found an upstream report which I suspect is the same thing: https://gitlab.gnome.org/GNOME/clutter/issues/6 Things which make it work again - Building glib2.0 w/gcc-9 -O1 (and -O0) - Building w/gcc-8 (obviously) Happy to help try to narrow this down to the optimisation that's breaking glib and/or the code in glib that's breaking the optimisation. Maybe in person at Debconf. For now I'm going to upload a glib2.0 that builds with gcc-8 explicitly. Cheers, -- Iain Lane [ i...@orangesquash.org.uk ] Debian Developer [ la...@debian.org ] Ubuntu Developer [ la...@ubuntu.com ]