On Wed, Nov 25, 2009 at 7:48 AM, Amos Jeffries <squ...@treenet.co.nz> wrote: > On Tue, 24 Nov 2009 16:13:37 -0700, Alex Rousskov > <rouss...@measurement-factory.com> wrote: >> On 11/20/2009 10:59 PM, Robert Collins wrote: >>> On Tue, 2009-11-17 at 08:45 -0700, Alex Rousskov wrote: >>>>>> Q1. What are the major areas or units of asynchronous code > execution? >>>>>> Some of us may prefer large areas such as "http_port acceptor" or >>>>>> "cache" or "server side". Others may root for AsyncJob as the > largest >>>>>> asynchronous unit of execution. These two approaches and their >>>>>> implications differ a lot. There may be other designs worth >>>>>> considering. >> >>> I'd like to let people start writing (and perf testing!) patches. To >>> unblock people. I think the primary questions are: >>> - do we permit multiple approaches inside the same code base. E.g. >>> OpenMP in some bits, pthreads / windows threads elsewhere, and 'job >>> queues' or some such abstraction elsewhere ? >>> (I vote yes, but with caution: someone trying something we don't >>> already do should keep it on a branch and really measure it well until >>> its got plenty of buy in). >> >> I vote for multiple approaches at lower levels of the architecture and >> against multiple approaches at highest level of the architecture. My Q1 >> was only about the highest levels, BTW. >> >> For example, I do not think it is a good idea to allow a combination of >> OpenMP, ACE, and something else as a top-level design. Understanding, >> supporting, and tuning such a mix would be a nightmare, IMO. >> >> On the other hand, using threads within some disk storage schemes while >> using processes for things like "cache" may make a lot of sense, and we >> already have examples of some of that working. >> > > OpenMP seems almost unanimous negative by the people who know it. >
OK >> >> This is why I believe that the decision of processes versus threads *at >> the highest level* of the architecture is so important. Yes, we are, >> can, and will use threads at lower levels. There is no argument there. >> The question is whether we can also use threads to split Squid into >> several instances of "major areas" like client side(s), cache(s), and >> server side(s). >> >> See Henrik's email on why it is difficult to use threads at highest >> levels. I am not convinced yet, but I do see Henrik's point, and I >> consider the dangers he cites critical for the right Q1 answer. >> >> >>> - If we do *not* permit multiple approaches, then what approach do we >>> want for parallelisation. E.g. a number of long lived threads that take >>> on work, or many transient threads as particular bits of the code need >>> threads. I favour the former (long lived 'worker' threads). >> >> For highest-level models, I do not think that "one job per >> thread/process", "one call per thread/process", or any other "one little >> short-lived something per thread/process" is a good idea. I do believe >> we have to parallelize "major areas", and I think we should support >> multiple instances of some of those "areas" (e.g., multiple client >> sides). Each "major area" would be long-lived process/thread, of course. > > Agreed. mostly. > > As Rob points out the idea is for one small'ish pathway of the code to be > run N times with different state data each time by a single thread. > > Sachins' initial AcceptFD thread proposal would perhapse be exemplar for > this type of thread. Where one thread does the comm layer; accept() through > to the scheduling call hand-off to handlers outside comm. Then goes back > for the next accept(). > > The only performance issue brought up was by you that its particular case > might flood the slower main process if done first. Not all code can be done > this way. > > Overheads are simply moving the state data in/out of the thread. IMO > starting/stopping threads too often is a fairly bad idea. Most events will > end up being grouped together into types (perhapse categorized by > component, perhapse by client request, perhapse by pathway) with a small > thread dedicated to handling that type of call. > >> >> Again for higher-level models, I am also skeptical that it is a good >> idea to just split Squid into N mostly non-cooperating nearly identical >> instances. It may be the right first step, but I would like to offer >> more than that in terms of overall performance and tunability. > > The answer to that is: of all the SMP models we theorize, that one is the > only proven model so far. > Administrators are already doing it with all the instance management > manually handled on quad+ core machines. With a lot of performance success. > > In last nights discussion on IRC we covered what issues are outstanding > from making this automatic and all are resolvable except cache index. It's > not easily shareable between instances. > >> >> I hope the above explains why I consider Q1 critical for the meant >> "highest level" scope and why "we already use processes and threads" is >> certainly true but irrelevant within that scope. >> >> >> Thank you, >> >> Alex. > > Thank you for clarifying that. I now think we are all more or less headed > in the same direction(s). With three models proposed for the overall > architecture. > > In the order they were brought up... (NP: the TODO only applies if we work > towards that goal) > > MODEL: * fully threaded. some helper child processes > PROS: > smaller memory resource footprint when running. > > CONS: > potentially larger CPU footprint swapping data between threads. > potential problems making threaded paths too small vs the overheads. > > TODO: > continue polishing the code into distinct calls > determine thread-safe code > determine shared data and add appropriate locking > make above segments into threads. > add some way to pass events/calls to existing long-term threads > either ... a super-lock as described by Henrik, > or ... a 2-queue alternative as described by Amos > > > MODEL: * process chunks with sub-threads and sometimes helper child > processes > PROS: > it's known to be very fast. but not amazingly so. (ref: postfix) (ref: > squid helpers) > > CONS: > current code uses a LOT of data sharing between components. particularly > of small 1-32 byte chunks of random data (config flags, stats, shared cache > data snippets). > identifying distinct chunks is a big time consuming issue. > > TODO: > identify the major process chunks and splitting out from the main binary > add efficient ways to pass data between cleanly between processes (at > capacity). > copy relevant external shared data into the state objects to pass along > with the request data > plus all same TODO from fully-threaded model, for the sub-threads within > each process. > > > MODEL: * separate instances with sub-threads and helper child processes > PROS: > we can almost do the macro change today. (sub-threads later) > it can scale the base app speed up a reasonable percentage (ref: > apache2) > > CONS: > duplication of data. particularly in the storage. is very wasteful of > resources. > NP: apache evade this with effectively read-only disk data, all dynamics > are in the instance memory. > > TODO: > the -I option needs porting so the master can open main ports and > children share the listening. > finish the logging TCP module ideas (for reliable shared logging). > some code to make the master process handle multiple children. > some alterations to safely handle the shared config file settings > (cache_dir etc). > > > MODEL: * status-quo. > Where we continue to work on all the above TODOs as time permits and > needs require. wait and see which model gets finished first. > > PROS: > the way forward is already well known. > > CONS: > it's not fast enough reaching multi-CPU usage > > > The easiest way forward seems to be toward separate instances, with finer > grained threading and/or process chunking being done later after deeper > analysis for extra gains at each change. AGREED.... > > This makes me think that we are not in fact proposing competing models, > but simply looking at different levels of code. Each approach which has > come up may best be used at varying levels; upper (instances), middle > (processes, threads, jobs), and low (signals, events, cbdata, async calls). > > It also seems to me the top instances choice is the most easily reversed > if it's found to actually be a bad idea. The major support change being in > the parent main() code setting up for several children instances. > Possibilities there for configuring it on/off or how many instances. > > > Amos > > -- Mr. S. H. Malave Computer Science & Engineering Department, Walchand College of Engineering,Sangli. sachinmal...@wce.org.in