Hi, On 15/11/17 10:35, Adrian Bunk wrote: > On Wed, Nov 15, 2017 at 02:30:54PM +0500, Lev Lamberov wrote: >> Ср 15 ноя 2017 @ 08:06 Adrian Bunk <b...@debian.org>: >> ... >>> Same randomness on powerpcspe, and both have it already with 7.6.1+dfsg-1: >>> https://buildd.debian.org/status/logs.php?pkg=swi-prolog&arch=powerpc >>> https://buildd.debian.org/status/logs.php?pkg=swi-prolog&arch=powerpcspe >> ... >> At least, I cannot think of any other reason that 7.6.1-1 >> was built successfully on mips and 7.6.1-2 failed, where the only one >> change is disabling Java tests (due to CVE-2017-1000364). I've uploaded >> 7.6.1-2 on the next day after 7.6.1-1 upload. > > you already quoted the reason: > >> The main issue >> seems to be weaker read/write ordering constraints that break our >> lock-free data structures, resulting in more or less random bugs. > > Based on the mips/powerpc/powerpcspe results one could say that there > is a 50% chance that a build attempt fails.
I had a look at this and indeed the build fails on mips randomly usually hanging in the test_cgc test (as mentioned earlier). One thread always hangs in the global_generation function which is only enabled if ATOMIC_GENERATION_HACK is enabled (only on arches without 64-bit atomics). Excerpt from pl-inline.h:465 > static inline gen_t > global_generation(void) > { gen_t g; > gen_t last; > > do > { last = GD->_last_generation; > g = (gen_t)GD->_generation.gen_u<<32 | GD->_generation.gen_l; > } while ( unlikely(g < last) ); > > if ( unlikely(last != g) ) > GD->_last_generation = g; > > return g; > } In the above loop, GD->_generation and GD->_last_generation are not modified within the loop, nor are they declared volatile. Therefore the "last" and "g" assignments are invariant and GCC will hoist them out of the loop (into something like what is shown below). This causes the hangs on mips if g < last for some reason. last = GD->_last_generation; g = (gen_t)GD->_generation.gen_u<<32 | GD->_generation.gen_l; do {} while (g < last); As a side note, I also notice GD->_generation is loaded without any memory barriers on all architectures (pl-incl.h:999) which looks a bit dodgy (although I don't know if it actually breaks anything). IMO the best solution is to remove all the ATOMIC_GENERATION_HACK code and use libatomic, but this will take some porting work because swi-prolog uses the old __sync primitives everywhere. I have attached a hack which marks _generation and _last_generation as volatile. This seems to work but isn't a long term solution. James
--- a/src/pl-global.h +++ b/src/pl-global.h @@ -105,9 +105,9 @@ struct PL_global_data } signals; #endif #ifdef O_LOGICAL_UPDATE - ggen_t _generation; /* generation of the database */ + volatile ggen_t _generation; /* generation of the database */ #ifdef ATOMIC_GENERATION_HACK - gen_t _last_generation; /* see pl-inline.h, global_generation() */ + volatile gen_t _last_generation; /* see pl-inline.h, global_generation() */ #endif #endif
signature.asc
Description: OpenPGP digital signature