https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744

--- Comment #23 from Gleb Natapov <gleb at scylladb dot com> ---
(In reply to Jakub Jelinek from comment #20)
> (In reply to Gleb Natapov from comment #19)
> > (In reply to Jakub Jelinek from comment #18)
> > > (In reply to Gleb Natapov from comment #16)
> > > > Can you suggest an alternative to libgcc patch? Use other TLS model?
> > > > Allocate per thread storage dynamically somehow?
> > > 
> > > If we want to use TLS (which I hope we don't), then e.g. a single __thread
> > > pointer with some registered destructor that would free it on process exit
> > > could do the job, and on the first exception it would try to allocate 
> > > memory
> > > for the cache and other stuff and use that (otherwise, if memory 
> > > allocation
> > > fails, just take a lock and be non-scalable).
> > >
> > I see that sjlj uses __gthread_setspecific/__gthread_getspecific. Can we do
> > the same here?
> 
> Can? Yes.  Want?  Nope.  It is worse than TLS.
Got it. If __thread pointer is an acceptable solution I will be happy to
implement it.  

> 
> > > Another alternative, perhaps much better, if Torvald is going to improve
> > > rwlocks sufficiently, would be to use rwlock to guard writes to the cache
> > > etc. too, and perhaps somewhat enlarge the cache (either statically, or
> > > allow extending it through allocation).
> > > I'd expect that usually these apps that use exceptions too much only care
> > > about a couple of shared libraries, so writes to the cache ought to be 
> > > rare.
> > >
> > As I said in my previous reply, I tested the new rwlock and in congested 
> > case
> > it still slows does the system significantly, not the implementation fault,
> > cpu just does not like locked instruction much. Not having a lock will be
> > significantly better.
> 
> You still need at least one lock, the array of locks is definitely a bad
> idea.
>
I am not sure I agree. 64 lock will take one page of memory, which is
negligible amount nowadays and we can drop the array if compiled for single
threaded machine. Breaking one big lock into many smaller one is a common
technique to achieve scalability. 

Alternative is to make the lock to be per thread, then consumed memory is
proportional to amount of threads. 

> Perhaps if you are worried about using 2 different rwlocks, it would be
> possible to just use the glibc internal one, by adding dl_iterate_phdr
> alternate entrypoint - dl_iterate_phdr would then be documented to only
> allow a single thread in the callback, which it satisfies now and in newer
> libc could wrlock _dl_load_lock, and then dl_iterate_phdr alternate
> entrypoint would be documented to allow multiple threads in the callback
> (i.e. it could rdlock _dl_load_lock).  On the libgcc side then it would call
> dl_iterate_phdr_rd (or whatever name it would have) first, and perform only
> read-only lookup in the cache, and if it wouldn't find anything, it would
> call dl_iterate_phdr afterwards and tweak the cache.
>
Such interface will make new dl_iterate_phdr_rd to libgcc specific, also
scalablity will depend on cache efficiency, so while benchmark will show much
better result, real application will not benefit. Complex C++ applications tend
to have deep call chains.

Reply via email to