Re: Fast sigblock (AKA rtld speedup)
On 1/14/13 11:06 AM, John Baldwin wrote: On Saturday, January 12, 2013 11:25:47 am Jilles Tjoelker wrote: With that, I think fast sigblock is too much code and complication for a niche case. It does seem a bit complicated to me as well. Most of the extra atomics in multi-threaded applications are conditional on __isthreaded (or can be made so); therefore, performance loss from linking in libthr should be negligible in most cases. Sadly, this is not true. libstdc++ turns on locking if you merely link against libthr, not based on testing __isthreaded. (It does this by testing to see if pthread_once() works during startup, and we have to intentionally sabotage the pthread_once() in libc to fail for this to work which annoys me for an entirely different set of reasons.) At work we go to great lengths to avoid linking in libthr for exactly this reason (e.g. we have a custom port of boost that builds a separate set of boost libraries that are explicitly not linked against libthr), and we also care about exception performance (one of my co-workers submitted the PR about exception performance). I get frustrated when people ask me but why are you doing that?, but I have to know... why do we/you need fast exception handling? Are you throwing a high rate of exceptions? Or is it just that your application is that sensitive to exceptions being thrown that a single slowish one has an impact? -Alfred ___ freebsd-toolchain@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-toolchain To unsubscribe, send any mail to freebsd-toolchain-unsubscr...@freebsd.org
Re: Fast sigblock (AKA rtld speedup)
On 14 Jan 2013, at 17:47, Jilles Tjoelker wrote: The code which does that check is actually under contrib/gcc. Problem is, they designed __gthread_active_p() to distinguish threaded and unthreaded programming environments -- it must be known in advance and cannot be changed later. The code for the unthreaded environment then takes advantage of this by not even allocating memory for mutexes in some cases. It's worth taking a step back and asking why this code exists at all, and the main reason is that acquiring a mutex used to be really expensive. It still is on some fruit-flavoured operating systems, but elsewhere it's a single atomic operation in the uncontended case, and in that case the cache line will already be exclusively owned by the calling core in single-threaded code. I would much rather that we followed the example of Solaris and made the multithreaded case fast and the default than keep piling on hacks that allow code to shave off a few clock cycles in the single-threaded case. In particular, the popularity of multicore systems means that it is increasingly rare for code to be both single threaded and performance critical, so this seems like misplaced optimisation. I strongly suspect that making it possible to inline the uncontended lock case for a pthread mutex and eliminating all of the branches on __isthreaded would give us a net speedup in both single and multithreaded cases. This __gthread_active_p() thing is another barrier to bringing in a threaded plugin in an unthreaded application. Ports people spend a fair amount of time adding -pthread flags to things (such as perl) to work around this. This and the similar checks in libc cause a lot of pain, and it seems that the correct fix is ensuring that the performance penalty for linking libthr is so small that there is no point in avoiding it. David signature.asc Description: Message signed with OpenPGP using GPGMail
Re: LLVM Image Activator
On Sun, Jan 13, 2013 at 12:24:35PM -0800, John-Mark Gurney wrote: Nathan Whitehorn wrote this message on Sun, Jan 13, 2013 at 10:14 -0800: On 01/13/13 09:13, Konstantin Belousov wrote: On Sun, Jan 13, 2013 at 08:21:37AM -0800, Nathan Whitehorn wrote: On 01/13/13 05:20, Konstantin Belousov wrote: On Sun, Jan 13, 2013 at 12:41:09PM +0100, Ed Schouten wrote: Hi Kostik, 2013/1/7 Konstantin Belousov kostik...@gmail.com: I still do remember the buzz about the binary format 0xCAFEBABE, which AFAIR gained image activator support on several OSes, to be garbage collected. Maybe it would then be a good idea then to add some kind of general purpose remapping imgact? Example: /etc/imgacttab: cafebabe /usr/local/bin/java cffaedfe /usr/local/bin/osx_emulator 4243c0de /usr/bin/lli That way we still give people the freedom to play around with mapping their own executable formats, but don't need to maintain a bunch of imgacts. A generic module that could be somewhat customized at runtime to map offset+signature into the shebang path could be a possibility indeed. I strongly prefer to have it as module and not enabled by default. Asking Nathan for writing the thing is too much, IMHO, esp. in the response to the 50-lines hack. I think this is a good idea, since it both prevents a profusion of similar activators and works nicely in jails and similar environments. I probably won't write it quickly, but it should not take more than about 50 lines, so I can't imagine it will be that bad. There are some complications with this kind of design from the things in the XXX comment in imgact_llvm.c about handling argv[0] that I need to think some more about. Great. I do not believe in the 50 lines, but I am happy that you want to work this out. Why are you opposed to having it there by default? I think it's actually quite important that it be there by default. Having it not standard would be fine, but it should at least be in GENERIC. There are minimal security risks since it just munges begin_argv and doesn't even load the executable and it's little enough code that there should not be any kernel bloat to speak of. If things like this aren't enabled by default, no one can depend on them being there, no one will use it, and the point is entirely lost. All image activators demonstrated a constant stream of security holes. Even our ELF activator, and I was guilty there too. I definitely do not fight over the inclusion of the proposed activator into GENERIC, but do insist on the config option + module. OK, that sounds like a plan then. I'll try to code up something configurable in the next couple weeks, unless someone else beats me to it. I'll point out that file already has the magic (pun intended) that we are looking for, though I do realize that the code might be a bit much to import.. As someone who recently stuffed libmagic into a very constrained sandbox environment, I can safely assert that you don't want to go there. The code isn't written in a way that would make this easy and I definitely wouldn't want it in the kernel. -- Brooks pgp07tnkSUC1K.pgp Description: PGP signature
Re: Fast sigblock (AKA rtld speedup)
On Monday, January 14, 2013 1:24:04 pm David Chisnall wrote: On 14 Jan 2013, at 17:47, Jilles Tjoelker wrote: The code which does that check is actually under contrib/gcc. Problem is, they designed __gthread_active_p() to distinguish threaded and unthreaded programming environments -- it must be known in advance and cannot be changed later. The code for the unthreaded environment then takes advantage of this by not even allocating memory for mutexes in some cases. It's worth taking a step back and asking why this code exists at all, and the main reason is that acquiring a mutex used to be really expensive. It still is on some fruit-flavoured operating systems, but elsewhere it's a single atomic operation in the uncontended case, and in that case the cache line will already be exclusively owned by the calling core in single-threaded code. I would much rather that we followed the example of Solaris and made the multithreaded case fast and the default than keep piling on hacks that allow code to shave off a few clock cycles in the single-threaded case. In particular, the popularity of multicore systems means that it is increasingly rare for code to be both single threaded and performance critical, so this seems like misplaced optimisation. We have single-threaded performance critical applications that run on multicore systems (we just run several copies) and if we link in libthr, then pthread_mutex operations (even on uncontested locks) show up as one of the top consumers of CPU time when we profile our applications. I strongly suspect that making it possible to inline the uncontended lock case for a pthread mutex and eliminating all of the branches on __isthreaded would give us a net speedup in both single and multithreaded cases. I'm less certain. Note that you can't inline mutex ops until you expose the mutexes themselves to userland (that is, making pthread_mutex_t not be opaque). -- John Baldwin ___ freebsd-toolchain@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-toolchain To unsubscribe, send any mail to freebsd-toolchain-unsubscr...@freebsd.org