On 9/16/12 3:20 AM, Stefan Teleman wrote:
> On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara <[email protected]> wrote:
>
>> Now, to clear the confusion I created: the timing numbers I posted in the
>> attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
>> perfectly forwarding, no caching public interface (exemplified by a changed
>> grouping) performs better than the current implementation. It was that test
>> case that I hoped you could time, perhaps on SPARC, in both MT and ST
>> builds. The t.cpp program is for MT, s.cpp for ST.
>
> I got your patch, and have tested it.
>
> I have created two Experiments (that's what they are called) with the
> SunPro Performance Analyzer. Both experiments are targeting race
> conditions and deadlocks in the instrumented program, and both
> experiments are running the 22.locale.numpunct.mt program from the
> stdcxx test harness. One experiment is with your patch applied. The
> other experiment is with our (Solaris) patch applied.
>
> Here are the results:
I looked at the analysis more closely.
>
> 1. with your patch applied:
>
> http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.nts/
I see here (http://tinyurl.com/94pbmzc) that the implementation of the facet
public interface is forwarding, with no caching.
>
> 2. with our (Solaris) patch applied:
>
> http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.ts/
Unfortunately, can't do the same here. Could you please refresh my memory what
does the patch contain? This patch is not part of the patch set you published
here earlier (http://tinyurl.com/8pyql4g)?
AFAICT, the race accesses that the analyzer points out are writes to shared
locations which occur along the thread execution path. They do not necessarily
mean that a race condition exists, and in fact we know that no race condition
exists if the public facet interface forwards to the protected virtual
interface. Which is what was tested in the first analysis, looking at
_numpunct.h: http://tinyurl.com/94pbmzc
Looking elsewhere, also in the first analysis, the __rw_get_numpunct function
(src link points here: http://tinyurl.com/8ez85e2). All highlighted lines, each
performing a write to shared locations, are potential race points, but do not
lead to race conditions because of the proper synchronization we know occurs in
the __rw_setlocale class.
The number of race accesses in __rw_get_numpunct sums up to ~3400 race accesses,
with a forwarding patch. That you pointed out in a later email. That number was
a bit puzzling, but then looking at the thread function I see the test uses the
numpunct test suite code, which creates a locale and extracts the facet from it
in each iteration.
That means that, ideally, for 4 threads iterating 10000 times, I would expect
locales being created 40K times, and so for the facets and so for the
__rw_get_numpunct calls, etc. The number or race accesses collected, far less
than that, could be explained by a lesser degree of thread overlapping? I.e.,
some threads start earlier, others later, and only partially overlap?
If that is the case I would not ascribe much importance to these numbers. As I
think was pointed out earlier, a numpunct facet is initialized at the first trip
in __rw_get_numpunct and that trip is (only then) properly synchronized. All
subsequent trips in __rw_get_numpunct find the facet data already there and they
just read it, no synchronization needed, and return it. Therefore, the cost of
initialization/synchronization is paid only once.
Thanks.
Liviu