On 10/26/2012 06:50 AM, Liviu Nicoara wrote:
On 10/03/12 11:10, Martin Sebor wrote:
[...]
I was just thinking of a few simple loops along the lines of:

void* thread_func (void*) {
for (int i = 0; i < N; ++)
test 1: do some simple stuff inline
test 2: call a virtual function to do the same stuff
test 3: lock and unlock a mutex and do the same stuff
}

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what "simple stuff" is (eventually, even when
it's getting numpunct::grouping() data).

tl;dr: removing the facet data cache is a priority. All else can be put
on the back-burner.

Conflicting test results aside, there still is the case of the incorrect
handling of the cached data in the facet. I don't think there is a
disagreement on that. Considering that the std::string is moving in the
direction of dropping the handle-body implementation, simply getting rid
of the cache is a step in the same direction.

I think that we should preserve the lock-free reading of the facet data,
as a benign race, but making it benign is perhaps more complicated than
previously suggested.

As a reminder, the core of the facet access and initialization code
essentially looks like this (pseudocode-ish):


// facet data accessor
...
if (0 == _C_impsize) { // 1
mutex_lock ();
if (_C_impsize)
return _C_data;
_C_data = get_facet_data (); // 2
?? // 3
_C_impsize = 1; // 4
mutex_unlock ();
}
?? // 5
return _C_data; // 6
...


with question marks for missing, necessary fixes. The compiler needs to
be prevented from re-ordering both 2-4 and 1-6. Just for the sake of
argument I can imagine an optimization that reorders the reads in 1-6:

register x = _C_data;
if (_C_impsize)
return x;

and if the loads are executed in this order, the caller will see a stale
_C_data.

First, the 2-4 writes need to be executed in the program order. This
needs both a compiler barrier and a store-store memory barrier that will
keep the writes ordered.

Then, the reads in 1-6 need to be ordered such that _C_data is read
after _C_impsize, via a compiler barrier and a load-load memory barrier
that will preserve the program order of the loads.

Various compilers provide these features in various forms, but at the
moment we don't have a unified STDCXX API to implement this.

Of course, I might be wrong. Input is appreciated.

You're right (we don't have an API for it, and it is the double
checked locking problem mentioned earlier on in this thread).
Implementing it here would bloat the function (just in terms
of source code, not necessarily object code), and potentially
introduce a binary compatibility dependency on the new code
(i.e., we wouldn't be able to remove the fix in the future
w/o braking it).

I suggested moving the body of the outer if from the header
into a .cpp file in the library where we could implement
ugly, bloated locking without the risk of breaking things
if we removed/replaced it in the future. That's when we ran
into questions about how exactly to do this cleanly, etc.
It didn't seem to be doable very cleanly but I still think
it's a viable approach.

Martin


Thanks,
Liviu


Reply via email to