On Fri, 2013-06-14 at 19:44 -0500, Peter Bergner wrote:
> I'm currently implementing support for hardware transactional memory in
> the rs6000 backend for POWER8.  Things seem to be mostly working, but I
> have run into a few issues I'm wondering whether other people are seeing.
> 
> For me, all of the libitm execution test cases in libitm/testsuite/libitm.c/
> compile and execute without error, except for reentrant.c, which hangs for me.
> My gdb hasn't been ported to support HTM on Power yet, so debugging has been
> slow, but what I've learned is, that my tbegin. instruction succeeds, but I
> fail the test (meaning someone has the write lock) at beginend.cc:200:
> 
>     if (unlikely(serial_lock.is_write_locked()))
>       htm_abort();
> 
> ...so we abort the transaction.  The failure is not persistent, so we do
> not break out of the loop due to:
> 
>     if (!htm_abort_should_retry(ret))
>       break;
> 
> We then fall into the following code, where we hang trying to get the
> read lock:
> 
>     serial_lock.read_lock(tx);
> 
> I have yet to track down who has the write lock and why, but I am working
> towards that.  Talking with Andreas, he said he is seeing the same failure
> on S390, so I'm wondering whether this might be a generic libitm issue
> and it might hit Intel too.

I think that this is a bug in libitm's HTM fastpath.  What I suppose
happens is that we have a relaxed outermost transaction that executes
unsafe code (see reentrant.c), thus switches to serial-irrevocable mode,
and then tries to start a nested transaction.  The nested txn then
observes in the HTM fastpath that there is a serial-mode txn already,
but it never checks whether it is enclosed in an already serial
outermost transaction.

> Does anyone know whether this executes correctly
> on Intel hardware with RTM?

I don't know currently, but I suppose the bug should trigger there too
(unless, for some reason, the nested txn always aborts immediately with
RTM).

> I'll note that if I hack the call to
> htm_abort_should_retry(ret) so that we break of of the loop and fallback
> to SW TM, then the test case executes correctly.

That matches what I suppose the bug is.

Please feel free to create a bug report.  I will work on a patch.

Torvald

Reply via email to