Re: Fast sigblock (AKA rtld speedup)

2013-01-14 Thread Alfred Perlstein

On 1/14/13 11:06 AM, John Baldwin wrote:

On Saturday, January 12, 2013 11:25:47 am Jilles Tjoelker wrote:

With that, I think fast sigblock is too much code and complication for a
niche case.

It does seem a bit complicated to me as well.


Most of the extra atomics in multi-threaded applications are conditional
on __isthreaded (or can be made so); therefore, performance loss from
linking in libthr should be negligible in most cases.

Sadly, this is not true.  libstdc++ turns on locking if you merely link
against libthr, not based on testing __isthreaded.  (It does this by testing
to see if pthread_once() works during startup, and we have to intentionally
sabotage the pthread_once() in libc to fail for this to work which annoys me
for an entirely different set of reasons.)

At work we go to great lengths to avoid linking in libthr for exactly this
reason (e.g. we have a custom port of boost that builds a separate set of
boost libraries that are explicitly not linked against libthr), and we also
care about exception performance (one of my co-workers submitted the PR about
exception performance).

I get frustrated when people ask me but why are you doing that?, but I 
have to know... why do we/you need fast exception handling?


Are you throwing a high rate of exceptions?  Or is it just that your 
application is that sensitive to exceptions being thrown that a single 
slowish one has an impact?


-Alfred
___
freebsd-toolchain@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-toolchain
To unsubscribe, send any mail to freebsd-toolchain-unsubscr...@freebsd.org


Re: Fast sigblock (AKA rtld speedup)

2013-01-14 Thread David Chisnall
On 14 Jan 2013, at 17:47, Jilles Tjoelker wrote:

 The code which does that check is actually under contrib/gcc. Problem
 is, they designed __gthread_active_p() to distinguish threaded and
 unthreaded programming environments -- it must be known in advance and
 cannot be changed later. The code for the unthreaded environment then
 takes advantage of this by not even allocating memory for mutexes in
 some cases.

It's worth taking a step back and asking why this code exists at all, and the 
main reason is that acquiring a mutex used to be really expensive.  It still is 
on some fruit-flavoured operating systems, but elsewhere it's a single atomic 
operation in the uncontended case, and in that case the cache line will already 
be exclusively owned by the calling core in single-threaded code.  

I would much rather that we followed the example of Solaris and made the 
multithreaded case fast and the default than keep piling on hacks that allow 
code to shave off a few clock cycles in the single-threaded case.  In 
particular, the popularity of multicore systems means that it is increasingly 
rare for code to be both single threaded and performance critical, so this 
seems like misplaced optimisation.

I strongly suspect that making it possible to inline the uncontended lock case 
for a pthread mutex and eliminating all of the branches on __isthreaded would 
give us a net speedup in both single and multithreaded cases.

 This __gthread_active_p() thing is another barrier to bringing in a
 threaded plugin in an unthreaded application. Ports people spend a fair
 amount of time adding -pthread flags to things (such as perl) to work
 around this.


This and the similar checks in libc cause a lot of pain, and it seems that the 
correct fix is ensuring that the performance penalty for linking libthr is so 
small that there is no point in avoiding it.

David



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: LLVM Image Activator

2013-01-14 Thread Brooks Davis
On Sun, Jan 13, 2013 at 12:24:35PM -0800, John-Mark Gurney wrote:
 Nathan Whitehorn wrote this message on Sun, Jan 13, 2013 at 10:14 -0800:
  On 01/13/13 09:13, Konstantin Belousov wrote:
   On Sun, Jan 13, 2013 at 08:21:37AM -0800, Nathan Whitehorn wrote:
   On 01/13/13 05:20, Konstantin Belousov wrote:
   On Sun, Jan 13, 2013 at 12:41:09PM +0100, Ed Schouten wrote:
   Hi Kostik,
  
   2013/1/7 Konstantin Belousov kostik...@gmail.com:
   I still do remember the buzz about the binary format 0xCAFEBABE, which
   AFAIR gained image activator support on several OSes, to be garbage
   collected.
  
   Maybe it would then be a good idea then to add some kind of general
   purpose remapping imgact? Example:
  
   /etc/imgacttab:
  
   cafebabe /usr/local/bin/java
   cffaedfe /usr/local/bin/osx_emulator
   4243c0de /usr/bin/lli
  
   That way we still give people the freedom to play around with mapping
   their own executable formats, but don't need to maintain a bunch of
   imgacts.
  
   A generic module that could be somewhat customized at runtime to map
   offset+signature into the shebang path could be a possibility indeed.
   I strongly prefer to have it as module and not enabled by default.
  
   Asking Nathan for writing the thing is too much, IMHO, esp. in
   the response to the 50-lines hack.
  
  
   I think this is a good idea, since it both prevents a profusion of
   similar activators and works nicely in jails and similar environments. I
   probably won't write it quickly, but it should not take more than about
   50 lines, so I can't imagine it will be that bad. There are some
   complications with this kind of design from the things in the XXX
   comment in imgact_llvm.c about handling argv[0] that I need to think
   some more about.
   Great. I do not believe in the 50 lines, but I am happy that you want
   to work this out.
   
  
   Why are you opposed to having it there by default? I think it's actually
   quite important that it be there by default. Having it not standard
   would be fine, but it should at least be in GENERIC. There are minimal
   security risks since it just munges begin_argv and doesn't even load the
   executable and it's little enough code that there should not be any
   kernel bloat to speak of. If things like this aren't enabled by default,
   no one can depend on them being there, no one will use it, and the point
   is entirely lost.
   All image activators demonstrated a constant stream of security holes.
   Even our ELF activator, and I was guilty there too.
   
   I definitely do not fight over the inclusion of the proposed activator
   into GENERIC, but do insist on the config option + module.
   
  
  OK, that sounds like a plan then. I'll try to code up something
  configurable in the next couple weeks, unless someone else beats me to it.
 
 I'll point out that file already has the magic (pun intended) that we
 are looking for, though I do realize that the code might be a bit much
 to import..

As someone who recently stuffed libmagic into a very constrained sandbox
environment, I can safely assert that you don't want to go there.  The
code isn't written in a way that would make this easy and I definitely
wouldn't want it in the kernel.

-- Brooks


pgp07tnkSUC1K.pgp
Description: PGP signature


Re: Fast sigblock (AKA rtld speedup)

2013-01-14 Thread John Baldwin
On Monday, January 14, 2013 1:24:04 pm David Chisnall wrote:
 On 14 Jan 2013, at 17:47, Jilles Tjoelker wrote:
 
  The code which does that check is actually under contrib/gcc. Problem
  is, they designed __gthread_active_p() to distinguish threaded and
  unthreaded programming environments -- it must be known in advance and
  cannot be changed later. The code for the unthreaded environment then
  takes advantage of this by not even allocating memory for mutexes in
  some cases.
 
 It's worth taking a step back and asking why this code exists at all, and 
the main reason is that acquiring a mutex used to be really expensive.  It 
still is on some fruit-flavoured operating systems, but elsewhere it's a 
single atomic operation in the uncontended case, and in that case the cache 
line will already be exclusively owned by the calling core in single-threaded 
code.  
 
 I would much rather that we followed the example of Solaris and made the 
multithreaded case fast and the default than keep piling on hacks that allow 
code to shave off a few clock cycles in the single-threaded case.  In 
particular, the popularity of multicore systems means that it is increasingly 
rare for code to be both single threaded and performance critical, so this 
seems like misplaced optimisation.

We have single-threaded performance critical applications that run on 
multicore systems (we just run several copies) and if we link in libthr, then 
pthread_mutex operations (even on uncontested locks) show up as one of the top 
consumers of CPU time when we profile our applications.

 I strongly suspect that making it possible to inline the uncontended lock 
case for a pthread mutex and eliminating all of the branches on __isthreaded 
would give us a net speedup in both single and multithreaded cases.

I'm less certain.  Note that you can't inline mutex ops until you expose
the mutexes themselves to userland (that is, making pthread_mutex_t not
be opaque).

-- 
John Baldwin
___
freebsd-toolchain@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-toolchain
To unsubscribe, send any mail to freebsd-toolchain-unsubscr...@freebsd.org