On 9/16/12 3:20 AM, Stefan Teleman wrote:
> On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara <[email protected]> wrote:
>
>> Now, to clear the confusion I created: the timing numbers I posted in the
>> attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
>> perfectly forwarding, no caching public interface (exemplified by a changed
>> grouping) performs better than the current implementation. It was that test
>> case that I hoped you could time, perhaps on SPARC, in both MT and ST
>> builds. The t.cpp program is for MT, s.cpp for ST.
>
> I got your patch, and have tested it.
>
> I have created two Experiments (that's what they are called) with the
> SunPro Performance Analyzer. Both experiments are targeting race
> conditions and deadlocks in the instrumented program,  and both
> experiments are running the 22.locale.numpunct.mt program from the
> stdcxx test harness. One experiment is with  your patch applied. The
> other experiment is with our (Solaris) patch applied.
>
> Here are the results:

I looked at the analysis more closely.

>
> 1. with your patch applied:
>
> http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.nts/

I see here (http://tinyurl.com/94pbmzc) that the implementation of the facet public interface is forwarding, with no caching.

>
> 2. with our (Solaris) patch applied:
>
> http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.ts/

Unfortunately, can't do the same here. Could you please refresh my memory what does the patch contain? This patch is not part of the patch set you published here earlier (http://tinyurl.com/8pyql4g)?

AFAICT, the race accesses that the analyzer points out are writes to shared locations which occur along the thread execution path. They do not necessarily mean that a race condition exists, and in fact we know that no race condition exists if the public facet interface forwards to the protected virtual interface. Which is what was tested in the first analysis, looking at _numpunct.h: http://tinyurl.com/94pbmzc

Looking elsewhere, also in the first analysis, the __rw_get_numpunct function (src link points here: http://tinyurl.com/8ez85e2). All highlighted lines, each performing a write to shared locations, are potential race points, but do not lead to race conditions because of the proper synchronization we know occurs in the __rw_setlocale class.

The number of race accesses in __rw_get_numpunct sums up to ~3400 race accesses, with a forwarding patch. That you pointed out in a later email. That number was a bit puzzling, but then looking at the thread function I see the test uses the numpunct test suite code, which creates a locale and extracts the facet from it in each iteration.

That means that, ideally, for 4 threads iterating 10000 times, I would expect locales being created 40K times, and so for the facets and so for the __rw_get_numpunct calls, etc. The number or race accesses collected, far less than that, could be explained by a lesser degree of thread overlapping? I.e., some threads start earlier, others later, and only partially overlap?

If that is the case I would not ascribe much importance to these numbers. As I think was pointed out earlier, a numpunct facet is initialized at the first trip in __rw_get_numpunct and that trip is (only then) properly synchronized. All subsequent trips in __rw_get_numpunct find the facet data already there and they just read it, no synchronization needed, and return it. Therefore, the cost of initialization/synchronization is paid only once.

Thanks.

Liviu

Reply via email to