On Tue, Mar 16, 2010 at 12:15 AM, James Mansion <ja...@mansionfamily.plus.com> wrote: > Brandon Black wrote: >> >> However, the thread model as typically used scales poorly across >> multiple CPUs as compared to distinct processes, especially as one >> scales up from simple SMP to the ccNUMA style we're seeing with >> > > This is just not true.
I think it is, "as typically used". Again, you can threaded write software that doesn't use mutexes and doesn't write to the same memory from two threads, but then you're not really using much of threads. One of my major projects right now is structured that way actually, so I certainly see your argument and use it :). However, using threads this way isn't much different from using processes. The advantage to using threads over processes in that case is that you don't have to go around making SysV shm or mmap(MAP_SHARED) areas for the data you intend to share, and certain other aspects of the OS interface (like how you cleanly start and stop the groups of processes, how you handle signals, etc) are simplified. With either model you can still take advantage of pthread mutexes and condition variables and so-on (see pthread_mutexattr_setpshared(), etc) once you've established your explicitly-shared memory regions, if those mechanisms make the most sense for you. That's really the crux: explicit sharing where necessary, versus implicitly sharing everything and relying on programmer discipline to avoid Bad Things. >> >> large-core-count Opteron and Xeon -based machines these days. This is >> mostly because of memory access and data caching issues, not because >> of context switching. The threads thrash on caching memory that >> they're both writing to (and/or content on locks, it's related), and >> some of the threads are running on a different NUMA node than where >> the data is (in some cases this is very pathological, especially if >> you haven't had each thread allocate its own memory with a smart >> malloc). >> > > This affects processes too. The key aspect is how much data is actively > shared between > the different threads *and is changing*. The heap is a key point of > contention but thread > caching heap managers are a big help - and the process model only works when > you don't > need to share the data. We agree on this point up until your last statement. The process model shares data just as well as the thread model, you just have to explicitly state what is shared. > Large core count Opteron and Xeon chips are *reducing* the NUMA affects with > modest numbers of threads because the number of physical sockets is going > down > and the integration on a socket is higher, and when you communicate between > your > coprocesses you'll have the same data copying - its not free. > > Sure - scaling with threads on a 48 core system is a challenge. Scaling on a > glueless > 8 core system (or on a share of a host with more cores) is more relevant to > what most > of us do though. > > Clock speed isn't going anywhere at the moment, but core count is - and so > is the > available LAN performance. I think I see this very differently. We agree that clock speeds are going nowhere, leading to higher core counts per die. Where we're at now is that a cheap-ish 1U server may well have 2 NUMA nodes with 2-4 cores each, and both the core count per NUMA node and the total number of NUMA nodes is likely to increase going forward. NUMA is going to become a single-server scaling problem, even for small servers. It's the only way a small server that costs a fixed $X is going to continue to scale up in performance over the years now. Either they're going to both increase core count slightly and increase the count of (smaller) dies in a small system, having each die be a NUMA node, or stuff 16+ cores in a die and try to stick with just 2 dies in the system, in which case they put some NUMA hierarchy inside each die. But either way, there's going to be more NUMA involved as we add more cores to a single system. We've been down the 16 (or more) -way uniform-memory-access SMP road years ago, it simply doesn't scale. [... skipped some of the rest, relatively minor points and this debate is getting long ...] > are static. That's a long way different from sharing only a small flat > memory resource and > some pipe IPC. Its more convenient and can be a lot faster. Processes *can* do everything threads do. Even the exact same forms of memory sharing and IPC, with the same efficiencies. The pthread API itself is usable across process boundaries, or you can do other equivalent things. [ ...] > Let me ask you - how do you think memcached should have scaled past their > original single-thread performance? libevent isn't massively slow with > modest > numbers of connections, and theer's a lot of shared state. And that's even > before > you consider running a secure or authenticated connection. There are two levels of performance problem we're talking about in general. One is scaling up on a single core, which means avoiding blockage. The other is scaling a good single-CPU solution over multiple cores. Threads and event libraries are two ways of accomplishing the first, and threads and processes are two ways of accomplishing the second. Therefore on a multi-CPU system, your aggregate choices are: 1) process-per-core with an eventloop per process: arguably the most efficient 2) process-per-core with threads within each process: slightly less efficient 3) thread-per-core with an eventloop per thread: can be nearly as efficient as (1) if very carefully designed such that the threads greatly resemble the behavior of processes, but is more prone to difficult-to-debug bugs and race conditions. 4) All threads. At this point thread count is pretty much orthogonal to core count as you're using them to solve both problems. So if we're talking about scaling up memcached (or anything of similar architecture) performance on a single CPU core, an event library like libev would have been best. The next best thing on a single core would've been threads. Over multiple cores, you clearly need multiple threads of execution. Those threads of execution could be either processes or threads. With threads, everything would be shared by default, and it would be very easy for the developers to make mistakes that led to race conditions, rare bugs, and poor performance. WIth processes, you would have explicitly shared the pool of memory containing the memcached data only, and the chances for threads-related bugs/slowdowns related to all other heap data in memcached would be significantly reduced. Within the process-per-core you've created, you'd use either events (preferred) or threads to avoid blocking up on that one core. Either can perform just as well and have the same capabilities for shared data and synchronization. -- Brandon _______________________________________________ libev mailing list libev@lists.schmorp.de http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev