On 09/11/2012 04:15 PM, Stefan Teleman wrote:
On Mon, Sep 10, 2012 at 4:24 PM, Stefan Teleman
I think I have something which doesn't break BC - stay tuned because
I'm testing it now.

OK.

So, here's a possible implementation of __rw_get_numpunct() with
minimal locking, which passes the MT tests and does not break ABI:

s247136804.onlinehome.us/stdcxx-1056-20120911/punct.cpp

And the same for include/loc/_numpunct.h:

http://s247136804.onlinehome.us/stdcxx-1056-20120911/_numpunct.h

In _numpunct.h, all the functions perform no checks and no lazy
initialization. They function simply as a pass-through to
__rw_get_numpunct(). std::numpunct<T>'s data members are now dead
varaiables.

The bad: performance is no better than with locking the mutex inside
each of the std::numpunct<T>::*() functions, and with lazy
instantiation.

I wouldn't expect this to be faster than the original. In fact,
I would expect it to be slower because each call to one of the
public, non-virtual members results in a call to the out-of-line
virtual functions (and another to __rw_get_moneypunct). Avoiding
the overhead of such calls is the main and only reason why the
caching exists.

FWIW, there is one thing we might be able to take advantage of
the most important case to optimize is the "C" locale, i.e.,
numpunct<char>. numpunct_byname<charT> is secondary.
numpunct<wchar_t> is a distant third (although it should be
just as easy or hard to optimize as the char specialization).
And the values of all the members in the "C" locale are known
at compile time.


1. SunPro:

real 2628.78
user 3020.97
sys 178.65

2. Intel:

real 2439.77
user 2712.62
sys 166.92
Tue Sep 11 16:32:28 EDT 2012

These numbers are for the full test with no arguments (i.e. 8 threads,
32 locales, 200000 iterations each). The SunPro build is not optimized
(-g -xO0 -xkeepframe=%all). The Intel build is optimized (-O2).

I'm afraid unoptimized timings don't tell us much. Neither does
a comparison between two compilers, even on the same OS.

I looked at Liviu's timings today. I was puzzled by the difference
between (1) which, IIUC, is the current implementation (presumably
an optimized, thread-safe build with the same compiler and OS) and
(4), which, again IIUC, is the equivalent of your latest patch here
(again, presumably optimized, thread safe, same compiler/OS). I'm
having trouble envisioning how calling a virtual function to
retrieve the value of grouping can possibly be faster than not
calling it (and simply returning the value cached in the data
member of the facet.


Other things into consideration:

Implementing the private data member access hack for std::numpunct<T>
and std::moneypunct<T>  in src/access.h will not reduce the mutex
locking in __rw_get_numpunct(). Whether the data is being copied from
the __rw_punct_t struct directly into a private data member accessed
via hack, or the private data member is initialized with the data
returned from __rw_get_numpunct() is cost-irrelevant: the mutex still
has to be locked while the copying takes place.

True. The hack simply avoids the deadlock. But other solutions
are possible (e.g., calling do_xxx() and storing the result in
a temporary, and then locking the member mutex and assigning to
the _C_xxx string data member.

The other advantage of such a hack is that it hides the "ugly"
implementation details in the .cpp file. This makes it possible
to change the details in the future without affecting the ABI
(unlike changing locking in an inline function, which can have
an impact on binary compatibility).


This leaves us with considering the option of breaking the ABI for the
sake of optimization, and introducing a mutex in each of the
std::numpunct<T>  and std::moneypunct<T>  classes. I am personally not
in favor of breaking the ABI.

Breaking the ABI means revving the major version number (i.e.,
waiting to fix it until 5.0). Otherwise, it's simply not an
option for 4.2.x or 4.3 (as per the version policy).

Martin


--Stefan


Reply via email to