Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-08-20 Thread Simon McVittie
Control: reassign -1 clutter-1.0-tests
Control: severity -1 serious

We were trying so hard to solve this in either gcc-9 or libglib2.0-0
that we didn't consider whether it could be a clutter bug. (It is.)

On Fri, 12 Jul 2019 at 11:16:53 +0100, Iain Lane wrote:
> Here's the bit of code.
> 
>   
> https://sources.debian.org/src/clutter-1.0/1.26.2+dfsg-10/tests/conform/actor-offscreen-redirect.c/#L172
> 
> It's adding some stuff to a main loop and expecting it to finish when a
> particular signal handler is called.
...
> Things which make it work again
> 
>   - Building glib2.0 w/gcc-9 -O1 (and -O0)
>   - Building w/gcc-8

This appears to have been because building gtestutils.c with different
optimizations results in different junk being left on the stack afterwards.
When running the clutter test under valgrind, we get:

# Start of actor tests
# Start of offscreen tests
==13864== Conditional jump or move depends on uninitialised value(s)
==13864==at 0x10AD7C: actor_offscreen_redirect 
(actor-offscreen-redirect.c:331)
==13864==by 0x10AD7C: actor_offscreen_redirect 
(actor-offscreen-redirect.c:299)
==13864==by 0x492F889: clutter_test_func_wrapper (clutter-test-utils.c:138)
==13864==by 0x4B6F3BD: ??? (in 
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6000.6)

This is the variable 'data' here:

> static void
> actor_offscreen_redirect (void)
> {
>   Data data;
>
>   ... data.was_painted is never initialized ...
>
>   while (!data.was_painted)
> g_main_context_iteration (NULL, FALSE);
> }

It seems that data.was_painted was intended to be initialized to FALSE
(all-zeroes), but this never actually happened.

If the uninitialized value of data.was_painted happens to be nonzero,
this results in basically the entire test being skipped - we never enter
the main loop, and never have the opportunity for the test to hang while
waiting for a paint signal that will never happen.

Adding some debug code to hexdump the contents of the data struct reveals
that gcc-9 -O1, or gcc-9 -O2 with -fno-tree-pre, fairly reliably fills
data.was_painted with a nonzero value, so most of the test is effectively
never run. gcc-9 -O2 fills it with zeroes, so the test runs. The paint
signal never happens (at least in my testing) and the test hangs.

The attached is probably a good starting point for someone who has some
sort of understanding of Clutter to start to investigate this.

smcv
diff --git a/tests/conform/actor-offscreen-redirect.c b/tests/conform/actor-offscreen-redirect.c
index f47af3617..44b2e43c6 100644
--- a/tests/conform/actor-offscreen-redirect.c
+++ b/tests/conform/actor-offscreen-redirect.c
@@ -150,25 +150,37 @@ verify_results (Data *data,
   g_free (pixel);
 }
 
+static void
+paint_handler_cb (GMainLoop *main_loop,
+  gpointer nil)
+{
+  g_debug ("in paint_handler_cb");
+  g_main_loop_quit (main_loop);
+}
+
 static void
 verify_redraw (Data *data, int expected_paint_count)
 {
   GMainLoop *main_loop = g_main_loop_new (NULL, TRUE);
   guint paint_handler;
 
+  g_debug ("in verify_redraw");
+
   paint_handler = g_signal_connect_data (data->stage,
  "paint",
- G_CALLBACK (g_main_loop_quit),
+ G_CALLBACK (paint_handler_cb),
  main_loop,
  NULL,
  G_CONNECT_SWAPPED | G_CONNECT_AFTER);
 
   /* Queue a redraw on the stage */
+  g_debug ("queueing redraw");
   clutter_actor_queue_redraw (data->stage);
 
   data->foo_actor->paint_count = 0;
 
   /* Wait for it to paint */
+  g_debug ("running main loop");
   g_main_loop_run (main_loop);
 
   g_signal_handler_disconnect (data->stage, paint_handler);
@@ -181,6 +193,8 @@ run_verify (gpointer user_data)
 {
   Data *data = user_data;
 
+  g_debug ("in run_verify");
+
   group_has_overlaps = FALSE;
 
   /* By default the actor shouldn't be redirected so the redraw should
@@ -298,7 +312,7 @@ run_verify (gpointer user_data)
 static void
 actor_offscreen_redirect (void)
 {
-  Data data;
+  Data data = {};
 
   if (!cogl_features_available (COGL_FEATURE_OFFSCREEN))
 return;


Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-08-20 Thread Simon McVittie
On Tue, 20 Aug 2019 at 12:21:43 +0200, Matthias Klose wrote:
> Control: tags -1 + upstream moreinfo
> 
> now https://gcc.gnu.org/PR91491

That bug report is a bit misleading: the glib2.0 build doesn't fail
(it succeeds and passes all its tests!), but it results in a GLib library
binary package that makes the clutter tests fail to terminate. clutter is
a library that depends on glib2.0.

As Iain previously noted, this is very, very far from a minimal reproducer,
but it's the only one we have.

> On 02.08.19 23:27, Simon McVittie wrote:
> > On Fri, 02 Aug 2019 at 19:49:20 +0100, Simon McVittie wrote:
> >> If you compile test_run_seed() with -O1, and the rest of gtestutils.c
> >> with -O2, the clutter test hangs.
> 
> so building this all with -O2, and with the flags above, it both fails?

The "bad" behaviour is that glib2.0 builds successfully, and the glib2.0
tests all pass, but the clutter test does not terminate and eventually
fails with a timeout. The "good" behaviour is that everything succeeds.

With those definitions, the results were:

- Build glib2.0 with gcc-8 -O2: good
- Build glib2.0 with gcc-9 -O2: bad
- Build glib2.0 with gcc-9 -O1: good
- Build glib2.0 with gcc-9 -O2, except test_run_seed() with -O1 via
  an __attribute__: good
- Build glib2.0 with gcc-9 -O2, except test_run_seed()
  with __attribute__((optimize("no-tree-pre"))): good

This was a few gcc-9 revisions ago, so I'll have to check whether subsequent
gcc-9 changes have affected the result.

> can you tell where exactly the test case loops?

It isn't in a busy-loop or anything like that; it's in the GLib event
loop (a typical general-purpose event loop based on poll(), similar to
libevent/libuv/etc.), waiting for an event that is meant to terminate the
event loop by waking it up via an eventfd, but that event never happens.

I have no idea why changing the optimization level of test_run_seed()
would affect the behaviour of the event loop much later in the program's
execution, but it seems like it does.

> can you outline how you did setup your test case for that?

I can try to do that later. Sorry, it was quite an improvised setup,
initially involving hacking the generated ninja.build to use a different
compiler for different files, so I cannot easily summarize it in a couple
of lines.

Again, this is very far from a minimal reproducer but it's the best
we have.

> can you have a look at the search script in
> https://gcc.gnu.org/viewcvs/gcc/branches/gcc-9-branch/gcc/dbgcnt.def?view=markup
> and run 'gcc "-fdbg-cnt=treepre_insert:$lb:$ub"' in $cmd? so the whole $cmd
> should rebuild gtestutils, perform the link steps and run the test.

I'll try.

smcv



Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-08-20 Thread Matthias Klose
Control: tags -1 + upstream moreinfo

now https://gcc.gnu.org/PR91491

On 02.08.19 23:27, Simon McVittie wrote:
> On Fri, 02 Aug 2019 at 19:49:20 +0100, Simon McVittie wrote:
>> If you compile test_run_seed() with -O1, and the rest of gtestutils.c
>> with -O2, the clutter test hangs.

so building this all with -O2, and with the flags above, it both fails?

> Binary-searching through the extra optimizations enabled by -O2 [1]
> led me to the minimal change being: if you modify test_run_seed() to add
> 
> __attribute__((optimize("no-tree-pre")))
> 
> then the clutter test passes. Without that attribute it fails.
> 
> I have no idea why.
> 
> smcv
> 
> [1] gcc-9 -Q -O2 --help=optimizers > O2
> gcc-9 -Q -O1 --help=optimizers > O1
> diff -u O1 O2
> 

can you tell where exactly the test case loops?

can you outline how you did setup your test case for that?

can you have a look at the search script in
https://gcc.gnu.org/viewcvs/gcc/branches/gcc-9-branch/gcc/dbgcnt.def?view=markup
and run 'gcc "-fdbg-cnt=treepre_insert:$lb:$ub"' in $cmd? so the whole $cmd
should rebuild gtestutils, perform the link steps and run the test.



Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-08-02 Thread Simon McVittie
On Fri, 02 Aug 2019 at 19:49:20 +0100, Simon McVittie wrote:
> If you compile test_run_seed() with -O1, and the rest of gtestutils.c
> with -O2, the clutter test hangs.

Binary-searching through the extra optimizations enabled by -O2 [1]
led me to the minimal change being: if you modify test_run_seed() to add

__attribute__((optimize("no-tree-pre")))

then the clutter test passes. Without that attribute it fails.

I have no idea why.

smcv

[1] gcc-9 -Q -O2 --help=optimizers > O2
gcc-9 -Q -O1 --help=optimizers > O1
diff -u O1 O2
From: Simon McVittie 
Date: Fri, 2 Aug 2019 20:13:39 +0100
Subject: Disable an optimization when building with gcc 9

This appears to break the clutter autopkgtest, but I don't know why.

Signed-off-by: Simon McVittie 
Bug: https://bugs.debian.org/931921
---
 glib/gtestutils.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/glib/gtestutils.c b/glib/gtestutils.c
index c9bc3dd..0908686 100644
--- a/glib/gtestutils.c
+++ b/glib/gtestutils.c
@@ -1607,6 +1607,9 @@ void
 test_built_files_dir = test_argv0_dirname;
 }
 
+#if defined(__GNUC__) && __GNUC__ >= 9 && !defined(__clang__)
+__attribute__((optimize("no-tree-pre")))
+#endif
 static void
 test_run_seed (const gchar *rseed)
 {


Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-08-02 Thread Simon McVittie
On Fri, 02 Aug 2019 at 11:41:05 +0100, Iain Lane wrote:
> For the record: at Debconf doko suggested to me that a way to start
> debugging this from the GCC end would be to produce one working and one
> "broken" build of the same version of glib2.0, and then copy object
> files from the broken to the working one until that too becomes broken.
> That sounded long / tedious / fiddly so we didn't manage to get round to
> doing it while we were both there, unfortunately.

This is not straightforward in the Meson/Ninja build system, but what I
could do was to hack build.ninja to compile different files with different
compilers, which narrowed it down to gtestutils.c (?!).

Playing with #pragma GCC optimize ("O1") and #pragma GCC optimize ("O2")
I was able to narrow it down further, to this:

If you compile test_run_seed() with -O1, and the rest of gtestutils.c
with -O2, the clutter test hangs.

(But I need to test that minimal change in a clean build to make sure
that's really true and not being affected by other hacks.)

smcv



Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-08-02 Thread Iain Lane
On Fri, Aug 02, 2019 at 11:26:48AM +0100, Simon McVittie wrote:
> On Fri, 12 Jul 2019 at 11:16:53 +0100, Iain Lane wrote:
> > I saw this on Ubuntu (9.1.0-8ubuntu1) but I've also reproduced this with
> > 9.1.0-8 on sid (w/gcc-defaults from experimental to use gcc-9 by
> > default).
> > 
> > clutter's tests hang:
> ...
> > It's adding some stuff to a main loop and expecting it to finish when a
> > particular signal handler is called.
> 
> I can reproduce this with current unstable GLib, 2.60.6-1 (after reverting
> the change that makes it explicitly build with gcc 8).
> 
> In the non-working case, it looks as though the paint signal is not
> getting emitted at all. The Clutter master clock appears to be the thing
> that is meant to be scheduling GLib main-loop events in this case (and
> then I got lost and couldn't work out what was wrong).

Thanks for reproducing.

For the record: at Debconf doko suggested to me that a way to start
debugging this from the GCC end would be to produce one working and one
"broken" build of the same version of glib2.0, and then copy object
files from the broken to the working one until that too becomes broken.
That sounded long / tedious / fiddly so we didn't manage to get round to
doing it while we were both there, unfortunately.

> I have been trying to reproduce this with the upstream GLib source code
> so that I can try a build of GLib with ThreadSanitizer or AddressSanitizer,
> but so far this has not been successful.

Hmm, I thought I tried that and did manage to reproduce. Perhaps that's
my memory being slightly faulty. Will try again (probably in a week
after I'm back from holidays) if you don't beat me to it.

Cheers,

-- 
Iain Lane  [ i...@orangesquash.org.uk ]
Debian Developer   [ la...@debian.org ]
Ubuntu Developer   [ la...@ubuntu.com ]


signature.asc
Description: PGP signature


Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-08-02 Thread Simon McVittie
On Fri, 12 Jul 2019 at 11:16:53 +0100, Iain Lane wrote:
> I saw this on Ubuntu (9.1.0-8ubuntu1) but I've also reproduced this with
> 9.1.0-8 on sid (w/gcc-defaults from experimental to use gcc-9 by
> default).
> 
> clutter's tests hang:
...
> It's adding some stuff to a main loop and expecting it to finish when a
> particular signal handler is called.

I can reproduce this with current unstable GLib, 2.60.6-1 (after reverting
the change that makes it explicitly build with gcc 8).

In the non-working case, it looks as though the paint signal is not
getting emitted at all. The Clutter master clock appears to be the thing
that is meant to be scheduling GLib main-loop events in this case (and
then I got lost and couldn't work out what was wrong).

> Things which make it work again
> 
>   - Building glib2.0 w/gcc-9 -O1 (and -O0)
>   - Building w/gcc-8 (obviously)

I can confirm that a GLib built from the same source code with gcc-9 -O1
works fine.

I have been trying to reproduce this with the upstream GLib source code
so that I can try a build of GLib with ThreadSanitizer or AddressSanitizer,
but so far this has not been successful.

smcv



Bug#931921: clutter's autopkgtests hang when ran with a libglib2.0-0 built with gcc-9

2019-07-12 Thread Iain Lane
Source: gcc-9
Version: 9.1.0-8
Severity: important

Control: affects -1 gcc libglib2.0-0

Hiya,

I saw this on Ubuntu (9.1.0-8ubuntu1) but I've also reproduced this with
9.1.0-8 on sid (w/gcc-defaults from experimental to use gcc-9 by
default).

clutter's tests hang like this:

0x77b316f4 in __GI___poll (fds=0x558d0420, nfds=2, timeout=-1) at 
../sysdeps/unix/sysv/linux/poll.c:29
29  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x77b316f4 in __GI___poll (fds=0x558d0420, nfds=2, timeout=-1) 
at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x77c62c9e in g_main_context_poll (priority=, 
n_fds=2, fds=0x558d0420, timeout=,
context=0x555b0cd0) at ../glib/gmain.c:4213
#2  g_main_context_iterate (context=0x555b0cd0, block=block@entry=1, 
dispatch=dispatch@entry=1, self=)
at ../glib/gmain.c:3909
#3  0x77c63023 in g_main_loop_run (loop=0x55f01d10) at 
../glib/gmain.c:4108
#4  0x6634 in verify_redraw (expected_paint_count=1, 
data=, data=)
at actor-offscreen-redirect.c:172

Here's the bit of code.

  
https://sources.debian.org/src/clutter-1.0/1.26.2+dfsg-10/tests/conform/actor-offscreen-redirect.c/#L172

It's adding some stuff to a main loop and expecting it to finish when a
particular signal handler is called.

There's obviously a lot of code going on here (this isn't anything like
a minimal testcase, and glib2.0's own testsuite & autopkgtests pass so
it's not like GMainLoop is completely broken).

I found an upstream report which I suspect is the same thing:

  https://gitlab.gnome.org/GNOME/clutter/issues/6

Things which make it work again

  - Building glib2.0 w/gcc-9 -O1 (and -O0)
  - Building w/gcc-8 (obviously)

Happy to help try to narrow this down to the optimisation that's
breaking glib and/or the code in glib that's breaking the optimisation.
Maybe in person at Debconf. For now I'm going to upload a glib2.0 that
builds with gcc-8 explicitly.

Cheers,

-- 
Iain Lane  [ i...@orangesquash.org.uk ]
Debian Developer   [ la...@debian.org ]
Ubuntu Developer   [ la...@ubuntu.com ]