Hi,

On 15/11/17 10:35, Adrian Bunk wrote:
> On Wed, Nov 15, 2017 at 02:30:54PM +0500, Lev Lamberov wrote:
>> Ср 15 ноя 2017 @ 08:06 Adrian Bunk <b...@debian.org>:
>> ...
>>> Same randomness on powerpcspe, and both have it already with 7.6.1+dfsg-1:
>>> https://buildd.debian.org/status/logs.php?pkg=swi-prolog&arch=powerpc
>>> https://buildd.debian.org/status/logs.php?pkg=swi-prolog&arch=powerpcspe
>> ...
>> At least, I cannot think of any other reason that 7.6.1-1
>> was built successfully on mips and 7.6.1-2 failed, where the only one
>> change is disabling Java tests (due to CVE-2017-1000364). I've uploaded
>> 7.6.1-2 on the next day after 7.6.1-1 upload.
> 
> you already quoted the reason:
> 
>> The main issue
>> seems to be weaker read/write ordering constraints that break our
>> lock-free data structures, resulting in more or less random bugs.
> 
> Based on the mips/powerpc/powerpcspe results one could say that there
> is a 50% chance that a build attempt fails.

I had a look at this and indeed the build fails on mips randomly usually
hanging in the test_cgc test (as mentioned earlier).

One thread always hangs in the global_generation function which is only
enabled if ATOMIC_GENERATION_HACK is enabled (only on arches without
64-bit atomics).

Excerpt from pl-inline.h:465
> static inline gen_t
> global_generation(void)
> { gen_t g;
>   gen_t last;
> 
>   do
>   { last = GD->_last_generation;
>     g = (gen_t)GD->_generation.gen_u<<32 | GD->_generation.gen_l;
>   } while ( unlikely(g < last) );
> 
>   if ( unlikely(last != g) )
>     GD->_last_generation = g;
> 
>   return g;
> }

In the above loop, GD->_generation and GD->_last_generation are not
modified within the loop, nor are they declared volatile. Therefore the
"last" and "g" assignments are invariant and GCC will hoist them out of
the loop (into something like what is shown below). This causes the
hangs on mips if g < last for some reason.

last = GD->_last_generation;
g = (gen_t)GD->_generation.gen_u<<32 | GD->_generation.gen_l;
do {} while (g < last);

As a side note, I also notice GD->_generation is loaded without any
memory barriers on all architectures (pl-incl.h:999) which looks a bit
dodgy (although I don't know if it actually breaks anything).

IMO the best solution is to remove all the ATOMIC_GENERATION_HACK code
and use libatomic, but this will take some porting work because
swi-prolog uses the old __sync primitives everywhere.

I have attached a hack which marks _generation and _last_generation as
volatile. This seems to work but isn't a long term solution.

James
--- a/src/pl-global.h
+++ b/src/pl-global.h
@@ -105,9 +105,9 @@ struct PL_global_data
   } signals;
 #endif
 #ifdef O_LOGICAL_UPDATE
-  ggen_t       _generation;            /* generation of the database */
+  volatile ggen_t      _generation;            /* generation of the database */
 #ifdef ATOMIC_GENERATION_HACK
-  gen_t                _last_generation;       /* see pl-inline.h, global_generation() */
+  volatile gen_t               _last_generation;       /* see pl-inline.h, global_generation() */
 #endif
 #endif

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to