On 1 Nov 2014 15:08, "Łukasz Tasz" <luk...@tasz.eu> wrote: > > Sure, I just made quick fix to test my test case, and immediately share it with you.
Sure, understood -- that's great, thanks. > I will try to send more polite fix:) > Regards > lt > > 1 lis 2014 09:06 "Fergus Henderson" <fer...@google.com> napisał(a): > >> Well, perhaps it would be a good idea to add a distccd flag or environment variable to control the queue length rather than hard-coding 10 or 256? >> >> On 31 Oct 2014 11:37, "Łukasz Tasz" <luk...@tasz.eu> wrote: >>> >>> Hi Guys, >>> >>> I'm very very happy, reasons of my failures are identified. >>> issue is in: >>> --- src/srvnet.c (wersja 177) >>> +++ src/srvnet.c (kopia robocza) >>> @@ -99,7 +99,7 @@ >>> rs_log_info("listening on %s", sa_buf ? sa_buf : "UNKNOWN"); >>> free(sa_buf); >>> >>> - if (listen(fd, 10)) { >>> + if (listen(fd, 256)) { >>> rs_log_error("listen failed: %s", strerror(errno)); >>> close(fd); >>> return EXIT_BIND_FAILED; >>> Index: src/io.c >>> >>> queue for new connetcion was minited to 10, that's why in case that >>> cluster is overloaded, many connection are reseted. >>> aim is to even wait 5 min for cluster availability, then compile localy. >>> >>> @Jarek, thanks for support! >>> >>> let's discuss if we should fix it or not. >>> >>> regards >>> Lukasz >>> >>> >>> Łukasz Tasz >>> >>> >>> 2014-10-24 10:27 GMT+02:00 Łukasz Tasz <luk...@tasz.eu>: >>> > Hi Martin >>> > >>> > What I have noticed. >>> > Client tries to connect distccd 3 times with 500ms delays in between. >>> > Linux kernel by default accept 128 connection. >>> > If client creates connection, even if no executors are avaliable, >>> > connection is accepted and queued by kernel running distccd. >>> > This leads to situation that client thinks that distccd is reserved, >>> > but in fact connection still waits to be accepted by distccd server. >>> > I suspect that then client starts communication too fast, distcc wont >>> > receive DIST token, and both sides waits, communication is broken, and >>> > then timeouts are applied for client default is applied, for server >>> > there is no defaults. >>> > >>> > fail scenarion is: >>> > one distccd, and two distcc users, both of them will try to compile >>> > with DISTCC_HOSTS=distccd/1,cpp,lzo, both users have lot of big >>> > objects, cluster is overloaded with ratio 2. >>> > This still should be OK, that third, and forth user will join cluster. >>> > >>> > Easy reproducer is to set one distcc, and set distcc_hosts=distccd/20, >>> > this is broken configuration, but simulates overload by 20 - 20 >>> > developers uses cluster in a same time. >>> > Please remember that those are exceptional situation, but developer >>> > can start compilation with -j 1000 from his laptop, and cluster will >>> > timeout, then receiving 1000 jobs on a laptop will end with memmory >>> > killer :D >>> > Those are exceptional situation, and somehow cluster should handle that. >>> > >>> > In the attachement, next to some pump changes, you can find change >>> > which is moving making connection to very beginning, when distcc is >>> > picking host, also remote connection is made. if this will fail, discc >>> > follow default behaviour, goes sleep for one sec, and will pick host >>> > again. But this requires additional administration change on distccd >>> > machine: >>> > iptables -I INPUT -p tcp --dport 3632 -m connlimit --connlimit-above >>> > <NUMBER OF DISTCCD> --connlimit-mask 0 -j REJECT --reject-with >>> > tcp-reset >>> > which accept only number of connection which equals to number of executors. >>> > >>> > So far so good! >>> > remark, patch is done on top of arankine_distcc_issue16-r335, since >>> > his pump changes are making pump mode working on my environment. >>> > But distccd allocation I tested also on latest official distcc release. >>> > >>> > let me know what you think! >>> > >>> > with best regards >>> > Lukasz >>> > >>> > >>> > >>> > Łukasz Tasz >>> > >>> > >>> > 2014-10-24 2:42 GMT+02:00 Martin Pool <m...@sourcefrog.net>: >>> >> It seems like if there's nowhere to execute the job, we want the client >>> >> program to just pause, before using too many resources, until it gets >>> >> unqueued by a server ready to do the job. (Or, by a local slot being >>> >> available.) >>> >> >>> >> >>> >> On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <luk...@tasz.eu> wrote: >>> >>> >>> >>> Hi Martin, >>> >>> >>> >>> Lets assume that you can trigger more compilation tasks executors then you >>> >>> have. >>> >>> In this scenario you are facing situation that cluster is saturated. >>> >>> When such a compilation will be triggered by two developers, or two CI >>> >>> (e.g jenkins) jobs, then cluster is saturated twice... >>> >>> >>> >>> Default behaviour is to lock locally slot, and try to connect three >>> >>> times, if not, fallback, if fallback is disabled CI got failed build >>> >>> (fallback is not the case, since local machine cannot handle -j >>> >>> $(distcc -j)). >>> >>> >>> >>> consider scenario, I have 1000 objects, 500 executors, >>> >>> - clean build on one machine takes >>> >>> 1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec, >>> >>> - on cluster (1000/500) * 20 sec = 40 sec >>> >>> >>> >>> Saturating cluster was impossible without pump mode, but now with pump >>> >>> mode after "warm up" effect, pump can dispatch many tasks, and I faced >>> >>> situation that saturated cluster destroys almost every compilation. >>> >>> >>> >>> My expectation is that cluster wont reject my connect, or reject will >>> >>> be handled, either by client, either by server. >>> >>> >>> >>> by server: >>> >>> - accept every connetion, >>> >>> - fork child if not accepted by child, >>> >>> - in case of pump prepare local dir structure, receive headers >>> >>> - --critical section starts here-- multi value semaphore with value >>> >>> maxchild >>> >>> - execute job >>> >>> - release semaphore >>> >>> >>> >>> >>> >>> Also what you suggested may be even better solution, since client will >>> >>> pick first avaliable executor instead of entering queue, so distcc >>> >>> could make connection already in function dcc_lock_one() >>> >>> >>> >>> I already tried to set DISTCC_DIR on a common nfs share, but in case >>> >>> you are triggering so many jobs, this started to be bottle neck... I >>> >>> won't tell about locking on nfs, and also scenario that somebody will >>> >>> make a lock on nfs and machine will got crash - will not work by >>> >>> design :) >>> >>> >>> >>> I know that scenario is not happening very often, and it has more or >>> >>> less picks characteristic, but we should be happy that distcc cluster >>> >>> is saturated and this case should be handled. >>> >>> >>> >>> hope it's more clear now! >>> >>> br >>> >>> LT >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Łukasz Tasz >>> >>> >>> >>> >>> >>> 2014-10-16 1:39 GMT+02:00 Martin Pool <m...@sourcefrog.net>: >>> >>> > Can you try to explain more clearly what difference in queueing behavior >>> >>> > you >>> >>> > expect from this change? >>> >>> > >>> >>> > I think probably the main change that's needed is for the client to ask >>> >>> > all >>> >>> > masters if they have space, to avoid needing to effectively poll by >>> >>> > retrying, or getting stuck waiting for a particular server. >>> >>> > >>> >>> > On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <luk...@tasz.eu> wrote: >>> >>> >> >>> >>> >> Hi Guys, >>> >>> >> >>> >>> >> please correct me if I'm wrong, >>> >>> >> - currently distcc tries to connect server 3 times, with small delay, >>> >>> >> - server forks x childs and all of them are trying to accept incoming >>> >>> >> connection. >>> >>> >> If server runs out of childs (all of them are busy), client will >>> >>> >> fallback, and within next 60 sec will not try this machine. >>> >>> >> >>> >>> >> What do you think about redesigning distcc in a way that master server >>> >>> >> will always accept inconing connection, fork a child, but in a same >>> >>> >> time only x of them will be able to enter compilation >>> >>> >> task(dcc_spawn_child)? (mayby preforking still could be used?) >>> >>> >> >>> >>> >> This may create kind of queue, client always can decide by his own, if >>> >>> >> can wait some time, or maximum is DISTCC_IO_TIMEOUT, but still it's >>> >>> >> faster to wait, since probably on a cluster side it's just a pick of >>> >>> >> saturation then making falback to local machine. >>> >>> >> >>> >>> >> currently I'm facing situation that many jobs are making fallback, and >>> >>> >> localmachine is being killed by make's -j calculated for distccd... >>> >>> >> >>> >>> >> other trick maybe to pick different machine, if current is busy, but >>> >>> >> this may be much more complex in my opinion. >>> >>> >> >>> >>> >> what do you think? >>> >>> >> regards >>> >>> >> Łukasz Tasz >>> >>> >> __ >>> >>> >> distcc mailing list http://distcc.samba.org/ >>> >>> >> To unsubscribe or change options: >>> >>> >> https://lists.samba.org/mailman/listinfo/distcc >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > -- >>> >>> > Martin >>> __ >>> distcc mailing list http://distcc.samba.org/ >>> To unsubscribe or change options: >>> https://lists.samba.org/mailman/listinfo/distcc
__ distcc mailing list http://distcc.samba.org/ To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/distcc