On Fri, 2013-06-14 at 19:44 -0500, Peter Bergner wrote: > I'm currently implementing support for hardware transactional memory in > the rs6000 backend for POWER8. Things seem to be mostly working, but I > have run into a few issues I'm wondering whether other people are seeing. > > For me, all of the libitm execution test cases in libitm/testsuite/libitm.c/ > compile and execute without error, except for reentrant.c, which hangs for me. > My gdb hasn't been ported to support HTM on Power yet, so debugging has been > slow, but what I've learned is, that my tbegin. instruction succeeds, but I > fail the test (meaning someone has the write lock) at beginend.cc:200: > > if (unlikely(serial_lock.is_write_locked())) > htm_abort(); > > ...so we abort the transaction. The failure is not persistent, so we do > not break out of the loop due to: > > if (!htm_abort_should_retry(ret)) > break; > > We then fall into the following code, where we hang trying to get the > read lock: > > serial_lock.read_lock(tx); > > I have yet to track down who has the write lock and why, but I am working > towards that. Talking with Andreas, he said he is seeing the same failure > on S390, so I'm wondering whether this might be a generic libitm issue > and it might hit Intel too.
I think that this is a bug in libitm's HTM fastpath. What I suppose happens is that we have a relaxed outermost transaction that executes unsafe code (see reentrant.c), thus switches to serial-irrevocable mode, and then tries to start a nested transaction. The nested txn then observes in the HTM fastpath that there is a serial-mode txn already, but it never checks whether it is enclosed in an already serial outermost transaction. > Does anyone know whether this executes correctly > on Intel hardware with RTM? I don't know currently, but I suppose the bug should trigger there too (unless, for some reason, the nested txn always aborts immediately with RTM). > I'll note that if I hack the call to > htm_abort_should_retry(ret) so that we break of of the loop and fallback > to SW TM, then the test case executes correctly. That matches what I suppose the bug is. Please feel free to create a bug report. I will work on a patch. Torvald