Well, perhaps it would be a good idea to add a distccd flag or environment variable to control the queue length rather than hard-coding 10 or 256? On 31 Oct 2014 11:37, "Łukasz Tasz" <luk...@tasz.eu> wrote:
> Hi Guys, > > I'm very very happy, reasons of my failures are identified. > issue is in: > --- src/srvnet.c (wersja 177) > +++ src/srvnet.c (kopia robocza) > @@ -99,7 +99,7 @@ > rs_log_info("listening on %s", sa_buf ? sa_buf : "UNKNOWN"); > free(sa_buf); > > - if (listen(fd, 10)) { > + if (listen(fd, 256)) { > rs_log_error("listen failed: %s", strerror(errno)); > close(fd); > return EXIT_BIND_FAILED; > Index: src/io.c > > queue for new connetcion was minited to 10, that's why in case that > cluster is overloaded, many connection are reseted. > aim is to even wait 5 min for cluster availability, then compile localy. > > @Jarek, thanks for support! > > let's discuss if we should fix it or not. > > regards > Lukasz > > > Łukasz Tasz > > > 2014-10-24 10:27 GMT+02:00 Łukasz Tasz <luk...@tasz.eu>: > > Hi Martin > > > > What I have noticed. > > Client tries to connect distccd 3 times with 500ms delays in between. > > Linux kernel by default accept 128 connection. > > If client creates connection, even if no executors are avaliable, > > connection is accepted and queued by kernel running distccd. > > This leads to situation that client thinks that distccd is reserved, > > but in fact connection still waits to be accepted by distccd server. > > I suspect that then client starts communication too fast, distcc wont > > receive DIST token, and both sides waits, communication is broken, and > > then timeouts are applied for client default is applied, for server > > there is no defaults. > > > > fail scenarion is: > > one distccd, and two distcc users, both of them will try to compile > > with DISTCC_HOSTS=distccd/1,cpp,lzo, both users have lot of big > > objects, cluster is overloaded with ratio 2. > > This still should be OK, that third, and forth user will join cluster. > > > > Easy reproducer is to set one distcc, and set distcc_hosts=distccd/20, > > this is broken configuration, but simulates overload by 20 - 20 > > developers uses cluster in a same time. > > Please remember that those are exceptional situation, but developer > > can start compilation with -j 1000 from his laptop, and cluster will > > timeout, then receiving 1000 jobs on a laptop will end with memmory > > killer :D > > Those are exceptional situation, and somehow cluster should handle that. > > > > In the attachement, next to some pump changes, you can find change > > which is moving making connection to very beginning, when distcc is > > picking host, also remote connection is made. if this will fail, discc > > follow default behaviour, goes sleep for one sec, and will pick host > > again. But this requires additional administration change on distccd > > machine: > > iptables -I INPUT -p tcp --dport 3632 -m connlimit --connlimit-above > > <NUMBER OF DISTCCD> --connlimit-mask 0 -j REJECT --reject-with > > tcp-reset > > which accept only number of connection which equals to number of > executors. > > > > So far so good! > > remark, patch is done on top of arankine_distcc_issue16-r335, since > > his pump changes are making pump mode working on my environment. > > But distccd allocation I tested also on latest official distcc release. > > > > let me know what you think! > > > > with best regards > > Lukasz > > > > > > > > Łukasz Tasz > > > > > > 2014-10-24 2:42 GMT+02:00 Martin Pool <m...@sourcefrog.net>: > >> It seems like if there's nowhere to execute the job, we want the client > >> program to just pause, before using too many resources, until it gets > >> unqueued by a server ready to do the job. (Or, by a local slot being > >> available.) > >> > >> > >> On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <luk...@tasz.eu> wrote: > >>> > >>> Hi Martin, > >>> > >>> Lets assume that you can trigger more compilation tasks executors then > you > >>> have. > >>> In this scenario you are facing situation that cluster is saturated. > >>> When such a compilation will be triggered by two developers, or two CI > >>> (e.g jenkins) jobs, then cluster is saturated twice... > >>> > >>> Default behaviour is to lock locally slot, and try to connect three > >>> times, if not, fallback, if fallback is disabled CI got failed build > >>> (fallback is not the case, since local machine cannot handle -j > >>> $(distcc -j)). > >>> > >>> consider scenario, I have 1000 objects, 500 executors, > >>> - clean build on one machine takes > >>> 1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec, > >>> - on cluster (1000/500) * 20 sec = 40 sec > >>> > >>> Saturating cluster was impossible without pump mode, but now with pump > >>> mode after "warm up" effect, pump can dispatch many tasks, and I faced > >>> situation that saturated cluster destroys almost every compilation. > >>> > >>> My expectation is that cluster wont reject my connect, or reject will > >>> be handled, either by client, either by server. > >>> > >>> by server: > >>> - accept every connetion, > >>> - fork child if not accepted by child, > >>> - in case of pump prepare local dir structure, receive headers > >>> - --critical section starts here-- multi value semaphore with value > >>> maxchild > >>> - execute job > >>> - release semaphore > >>> > >>> > >>> Also what you suggested may be even better solution, since client will > >>> pick first avaliable executor instead of entering queue, so distcc > >>> could make connection already in function dcc_lock_one() > >>> > >>> I already tried to set DISTCC_DIR on a common nfs share, but in case > >>> you are triggering so many jobs, this started to be bottle neck... I > >>> won't tell about locking on nfs, and also scenario that somebody will > >>> make a lock on nfs and machine will got crash - will not work by > >>> design :) > >>> > >>> I know that scenario is not happening very often, and it has more or > >>> less picks characteristic, but we should be happy that distcc cluster > >>> is saturated and this case should be handled. > >>> > >>> hope it's more clear now! > >>> br > >>> LT > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Łukasz Tasz > >>> > >>> > >>> 2014-10-16 1:39 GMT+02:00 Martin Pool <m...@sourcefrog.net>: > >>> > Can you try to explain more clearly what difference in queueing > behavior > >>> > you > >>> > expect from this change? > >>> > > >>> > I think probably the main change that's needed is for the client to > ask > >>> > all > >>> > masters if they have space, to avoid needing to effectively poll by > >>> > retrying, or getting stuck waiting for a particular server. > >>> > > >>> > On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <luk...@tasz.eu> > wrote: > >>> >> > >>> >> Hi Guys, > >>> >> > >>> >> please correct me if I'm wrong, > >>> >> - currently distcc tries to connect server 3 times, with small > delay, > >>> >> - server forks x childs and all of them are trying to accept > incoming > >>> >> connection. > >>> >> If server runs out of childs (all of them are busy), client will > >>> >> fallback, and within next 60 sec will not try this machine. > >>> >> > >>> >> What do you think about redesigning distcc in a way that master > server > >>> >> will always accept inconing connection, fork a child, but in a same > >>> >> time only x of them will be able to enter compilation > >>> >> task(dcc_spawn_child)? (mayby preforking still could be used?) > >>> >> > >>> >> This may create kind of queue, client always can decide by his own, > if > >>> >> can wait some time, or maximum is DISTCC_IO_TIMEOUT, but still it's > >>> >> faster to wait, since probably on a cluster side it's just a pick of > >>> >> saturation then making falback to local machine. > >>> >> > >>> >> currently I'm facing situation that many jobs are making fallback, > and > >>> >> localmachine is being killed by make's -j calculated for distccd... > >>> >> > >>> >> other trick maybe to pick different machine, if current is busy, but > >>> >> this may be much more complex in my opinion. > >>> >> > >>> >> what do you think? > >>> >> regards > >>> >> Łukasz Tasz > >>> >> __ > >>> >> distcc mailing list http://distcc.samba.org/ > >>> >> To unsubscribe or change options: > >>> >> https://lists.samba.org/mailman/listinfo/distcc > >>> > > >>> > > >>> > > >>> > > >>> > -- > >>> > Martin > __ > distcc mailing list http://distcc.samba.org/ > To unsubscribe or change options: > https://lists.samba.org/mailman/listinfo/distcc
__ distcc mailing list http://distcc.samba.org/ To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/distcc