Hi Martin

What I have noticed.
Client tries to connect distccd 3 times with 500ms delays in between.
Linux kernel by default accept 128 connection.
If client creates connection, even if no executors are avaliable,
connection is accepted and queued by kernel running distccd.
This leads to situation that client thinks that distccd is reserved,
but in fact connection still waits to be accepted by distccd server.
I suspect that then client starts communication too fast, distcc wont
receive DIST token, and both sides waits, communication is broken, and
then timeouts are applied for client default is applied, for server
there is no defaults.

fail scenarion is:
one distccd, and two distcc users, both of them will try to compile
with DISTCC_HOSTS=distccd/1,cpp,lzo, both users have lot of big
objects, cluster is overloaded with ratio 2.
This still should be OK, that third, and forth user will join cluster.

Easy reproducer is to set one distcc, and set distcc_hosts=distccd/20,
this is broken configuration, but simulates overload by 20 - 20
developers uses cluster in a same time.
Please remember that those are exceptional situation, but developer
can start compilation with -j 1000 from his laptop, and cluster will
timeout, then receiving 1000 jobs on a laptop will end with memmory
killer :D
Those are exceptional situation, and somehow cluster should handle that.

In the attachement, next to some pump changes, you can find change
which is moving making connection to very beginning, when distcc is
picking host, also remote connection is made. if this will fail, discc
follow default behaviour, goes sleep for one sec, and will pick host
again. But this requires additional administration change on distccd
machine:
iptables -I INPUT -p tcp --dport 3632 -m connlimit --connlimit-above
<NUMBER OF DISTCCD> --connlimit-mask 0 -j REJECT --reject-with
tcp-reset
which accept only number of connection which equals to number of executors.

So far so good!
remark, patch is done on top of arankine_distcc_issue16-r335, since
his pump changes are making pump mode working on my environment.
But distccd allocation I tested also on latest official distcc release.

let me know what you think!

with best regards
Lukasz



Łukasz Tasz


2014-10-24 2:42 GMT+02:00 Martin Pool <m...@sourcefrog.net>:
> It seems like if there's nowhere to execute the job, we want the client
> program to just pause, before using too many resources, until it gets
> unqueued by a server ready to do the job. (Or, by a local slot being
> available.)
>
>
> On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <luk...@tasz.eu> wrote:
>>
>> Hi Martin,
>>
>> Lets assume that you can trigger more compilation tasks executors then you
>> have.
>> In this scenario you are facing situation that cluster is saturated.
>> When such a compilation will be triggered by two developers, or two CI
>> (e.g jenkins) jobs, then cluster is saturated twice...
>>
>> Default behaviour is to lock locally slot, and try to connect three
>> times, if not, fallback, if fallback is disabled CI got failed build
>> (fallback is not the case, since local machine cannot handle -j
>> $(distcc -j)).
>>
>> consider scenario, I have 1000 objects, 500 executors,
>> - clean build on one machine takes
>>   1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec,
>> - on cluster (1000/500) * 20 sec = 40 sec
>>
>> Saturating cluster was impossible without pump mode, but now with pump
>> mode after "warm up" effect, pump can dispatch many tasks, and I faced
>> situation that saturated cluster destroys almost  every compilation.
>>
>> My expectation is that cluster wont reject my connect, or reject will
>> be handled, either by client, either by server.
>>
>> by server:
>> - accept every connetion,
>> - fork child if not accepted by child,
>> - in case of pump prepare local dir structure, receive headers
>> - --critical section starts here-- multi value semaphore with value
>> maxchild
>>   - execute job
>> - release semaphore
>>
>>
>> Also what you suggested may be even better solution, since client will
>> pick first avaliable executor instead of entering queue, so distcc
>> could make connection already in function dcc_lock_one()
>>
>> I already tried to set DISTCC_DIR on a common nfs share, but in case
>> you are triggering so many jobs, this started to be bottle neck... I
>> won't tell about locking on nfs, and also scenario that somebody will
>> make a lock on nfs and machine will got crash - will not work by
>> design :)
>>
>> I know that scenario is not happening very often, and it has more or
>> less picks characteristic, but we should be happy that distcc cluster
>> is saturated and this case should be handled.
>>
>> hope it's more clear now!
>> br
>> LT
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Łukasz Tasz
>>
>>
>> 2014-10-16 1:39 GMT+02:00 Martin Pool <m...@sourcefrog.net>:
>> > Can you try to explain more clearly what difference in queueing behavior
>> > you
>> > expect from this change?
>> >
>> > I think probably the main change that's needed is for the client to ask
>> > all
>> > masters if they have space, to avoid needing to effectively poll by
>> > retrying, or getting stuck waiting for a particular server.
>> >
>> > On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <luk...@tasz.eu> wrote:
>> >>
>> >> Hi Guys,
>> >>
>> >> please correct me if I'm wrong,
>> >> - currently distcc tries to connect server 3 times, with small delay,
>> >> - server forks x childs and all of them are trying to accept incoming
>> >> connection.
>> >> If server runs out of childs (all of them are busy), client will
>> >> fallback, and within next 60 sec will not try this machine.
>> >>
>> >> What do you think about redesigning distcc in a way that master server
>> >> will always accept inconing connection, fork a child, but in a same
>> >> time only x of them will be able to enter compilation
>> >> task(dcc_spawn_child)? (mayby preforking still could be used?)
>> >>
>> >> This may create kind of queue, client always can decide by his own, if
>> >> can wait some  time, or maximum is DISTCC_IO_TIMEOUT, but still it's
>> >> faster to wait, since probably on a cluster side it's just a pick of
>> >> saturation then making falback to local machine.
>> >>
>> >> currently I'm facing situation that many jobs are making fallback, and
>> >> localmachine is being killed by make's -j calculated for distccd...
>> >>
>> >> other trick maybe to pick different machine, if current is busy, but
>> >> this may be much more complex in my opinion.
>> >>
>> >> what do you think?
>> >> regards
>> >> Łukasz Tasz
>> >> __
>> >> distcc mailing list            http://distcc.samba.org/
>> >> To unsubscribe or change options:
>> >> https://lists.samba.org/mailman/listinfo/distcc
>> >
>> >
>> >
>> >
>> > --
>> > Martin
diff --exclude=.svn -rupN distcc-3.2-arankine/include_server/basics.py distcc-3.2rc2_new/include_server/basics.py
--- distcc-3.2-arankine/include_server/basics.py	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/include_server/basics.py	2014-09-05 11:29:48.238561887 +0200
@@ -180,7 +180,7 @@ MAX_EMAILS_TO_SEND = 3
 # importance to builds that involve compilations that distcc-pump does not grok:
 # an amount of time roughly equal to this quota is wasted before CPP is invoked
 # instead.
-USER_TIME_QUOTA = 3.8  # seconds
+USER_TIME_QUOTA = 20  # seconds
 
 # How often the following question is answered: has too much user time been
 # spent in the include handler servicing the current request?
diff --exclude=.svn -rupN distcc-3.2-arankine/Makefile.in distcc-3.2rc2_new/Makefile.in
--- distcc-3.2-arankine/Makefile.in	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/Makefile.in	2014-10-21 15:10:09.211453775 +0200
@@ -1072,9 +1072,7 @@ install-include-server: include-server p
 	    cp -f "$(include_server_builddir)/install.log" "$(PYTHON_INSTALL_RECORD)"; \
 	  fi; \
 	  $(mkinstalldirs) "$(DESTDIR)$(bindir)" && \
-	  INCLUDE_SERVER=`grep '/include_server.py$$' "$(include_server_builddir)/install.log"` && \
-	  sed "s,^include_server='',include_server='$$INCLUDE_SERVER'," \
-	    pump > "$(include_server_builddir)/pump" && \
+	  cp pump  "$(include_server_builddir)/pump" && \
 	  $(INSTALL_PROGRAM) "$(include_server_builddir)/pump" "$(DESTDIR)$(bindir)"; \
 	fi
 
diff --exclude=.svn -rupN distcc-3.2-arankine/pump.in distcc-3.2rc2_new/pump.in
--- distcc-3.2-arankine/pump.in	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/pump.in	2014-10-21 15:19:05.412323656 +0200
@@ -40,7 +40,12 @@ srcdir=@srcdir@
 # to the installed include_server.py.
 # NOTE: DO NOT CHANGE THE LINE BELOW WITHOUT CHANGING THE SED IN
 #       Makefile.in:install-include-server.
-include_server=''
+
+#taszluk this should be calculated while runtime, include_server.py is always relatively to pump
+#/usr/lib/python2.7/site-packages/include_server
+#/usr/bin/pump
+cdir=$(readlink -f $0)
+include_server=$(readlink -f $(dirname $cdir)/../lib/python2.7/site-packages/include_server/include_server.py)
 
 
 CheckUsage() {
@@ -497,7 +502,7 @@ DumpEnvironmentVariables() {
   if [ -n "$DISTCC_HOSTS" ]; then
     echo export DISTCC_HOSTS=\'$DISTCC_HOSTS\'
   fi
-  echo export PATH=\'$distcc_location:$PATH\'
+  #echo export PATH=\'$distcc_location:$PATH\'
 }
 
 Main() {
@@ -537,7 +542,7 @@ Main() {
       Announce
       StartIncludeServerAndDetermineHosts || exit 1
       # Now execute the command that is the argument of 'pump'.
-      PATH="$distcc_location:$PATH" \
+      #PATH="$distcc_location:$PATH" \
         "$@"
       # When we exit, the ShutDown function will be called.
       ;;
Pliki binarne distcc-3.2-arankine/.pump.in.swp i distcc-3.2rc2_new/.pump.in.swp się różnią
Pliki binarne distcc-3.2-arankine/src/.arg.c.swp i distcc-3.2rc2_new/src/.arg.c.swp się różnią
diff --exclude=.svn -rupN distcc-3.2-arankine/src/compile.c distcc-3.2rc2_new/src/compile.c
--- distcc-3.2-arankine/src/compile.c	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/src/compile.c	2014-10-16 14:10:53.842607744 +0200
@@ -504,6 +504,9 @@ dcc_build_somewhere(char *argv[],
     struct dcc_hostdef *host = NULL;
     char *discrepancy_filename = NULL;
     char **new_argv;
+    int to_net_fd = -1, from_net_fd = -1;
+    pid_t ssh_pid = 0;
+    
 
     if ((ret = dcc_expand_preprocessor_options(&argv)) != 0)
         goto clean_up;
@@ -549,7 +552,7 @@ dcc_build_somewhere(char *argv[],
 
     /* Choose the distcc server host (which could be either a remote
      * host or localhost) and acquire the lock for it.  */
-    if ((ret = dcc_pick_host_from_list_and_lock_it(&host, &cpu_lock_fd)) != 0) {
+    if ((ret = dcc_pick_host_from_list_and_lock_it(&host, &cpu_lock_fd, &to_net_fd, &from_net_fd, &ssh_pid)) != 0) {
         /* Doesn't happen at the moment: all failures are masked by
            returning localhost. */
         goto fallback;
@@ -638,7 +641,8 @@ dcc_build_somewhere(char *argv[],
                                   needs_dotd ? deps_fname : NULL,
                                   server_stderr_fname,
                                   cpp_pid, local_cpu_lock_fd,
-                  host, status)) != 0) {
+                                  host, status,
+                                  &to_net_fd, &from_net_fd, &ssh_pid)) != 0) {
         /* Returns zero if we successfully ran the compiler, even if
          * the compiler itself bombed out. */
 
diff --exclude=.svn -rupN distcc-3.2-arankine/src/compile.h distcc-3.2rc2_new/src/compile.h
--- distcc-3.2-arankine/src/compile.h	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/src/compile.h	2014-10-16 15:24:28.926644569 +0200
@@ -26,14 +26,22 @@
 int dcc_compile_remote(char **argv,
                        char *input_fname,
                        char *cpp_fname,
-                       char **file_names,
+                       char **files,
                        char *output_fname,
                        char *deps_fname,
                        char *server_stderr_fname,
                        pid_t cpp_pid,
                        int local_cpu_lock_fd,
                        struct dcc_hostdef *host,
-                       int *status);
+                       int *status,
+                       int *to_net_fd, 
+                       int *from_net_fd, 
+                       pid_t *ssh_pid);
+
+static int dcc_remote_connect(struct dcc_hostdef *host,
+                              int *to_net_fd,
+                              int *from_net_fd,
+                              pid_t *ssh_pid);
 
 /* compile.c */
 
diff --exclude=.svn -rupN distcc-3.2-arankine/src/mon.c distcc-3.2rc2_new/src/mon.c
--- distcc-3.2-arankine/src/mon.c	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/src/mon.c	2014-09-05 11:28:34.248008554 +0200
@@ -71,7 +71,7 @@
  * take a very long time, and combined with the check that the process
  * exists we can allow this to be reasonably large.
  */
-const int dcc_phase_max_age = 60;
+const int dcc_phase_max_age = 300;
 
 
 /**
diff --exclude=.svn -rupN distcc-3.2-arankine/src/remote.c distcc-3.2rc2_new/src/remote.c
--- distcc-3.2-arankine/src/remote.c	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/src/remote.c	2014-10-16 15:29:19.858766340 +0200
@@ -67,10 +67,11 @@ extern gss_ctx_id_t distcc_ctx_handle;
  */
 
 /**
+ * taszluk, moved to where.c
  * Open a connection using either a TCP socket or SSH.  Return input
  * and output file descriptors (which may or may not be different.)
  **/
-static int dcc_remote_connect(struct dcc_hostdef *host,
+/*static int dcc_remote_connect(struct dcc_hostdef *host,
                               int *to_net_fd,
                               int *from_net_fd,
                               pid_t *ssh_pid)
@@ -95,7 +96,7 @@ static int dcc_remote_connect(struct dcc
         rs_log_crit("impossible host mode");
         return EXIT_DISTCC_FAILED;
     }
-}
+}*/
 
 
 static int dcc_wait_for_cpp(pid_t cpp_pid,
@@ -202,11 +203,14 @@ int dcc_compile_remote(char **argv,
                        pid_t cpp_pid,
                        int local_cpu_lock_fd,
                        struct dcc_hostdef *host,
-                       int *status)
+                       int *status,
+                       int *to_net_fd_p, int *from_net_fd_p, pid_t *ssh_pid_p)
 {
-    int to_net_fd = -1, from_net_fd = -1;
+//    taszluk connection established upper
+    int to_net_fd = *to_net_fd_p;
+    int from_net_fd = *from_net_fd_p;
+    pid_t ssh_pid = *ssh_pid_p;
     int ret;
-    pid_t ssh_pid = 0;
     int ssh_status;
     off_t doti_size;
     struct timeval before, after;
@@ -223,8 +227,9 @@ int dcc_compile_remote(char **argv,
      * be over pipes, which are one-way connections. */
 
     *status = 0;
-    if ((ret = dcc_remote_connect(host, &to_net_fd, &from_net_fd, &ssh_pid)))
-        goto out;
+//taszluk, move it to dcc_pick_host_from_list_and_lock_it
+//    if ((ret = dcc_remote_connect(host, &to_net_fd, &from_net_fd, &ssh_pid)))
+//        goto out;
 
 #ifdef HAVE_GSSAPI
     /* Perform requested security. */
Pliki binarne distcc-3.2-arankine/src/.state.h.swp i distcc-3.2rc2_new/src/.state.h.swp się różnią
diff --exclude=.svn -rupN distcc-3.2-arankine/src/where.c distcc-3.2rc2_new/src/where.c
--- distcc-3.2-arankine/src/where.c	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/src/where.c	2014-10-21 13:37:18.334543429 +0200
@@ -76,15 +76,24 @@
 #include "lock.h"
 #include "where.h"
 #include "exitcode.h"
+#include "compile.h"
 
 
 static int dcc_lock_one(struct dcc_hostdef *hostlist,
                         struct dcc_hostdef **buildhost,
-                        int *cpu_lock_fd);
+                        int *cpu_lock_fd,
+                        int *to_net_fd,
+                        int *from_net_fd,
+                        pid_t *ssh_pid);
+
+static int dcc_remote_connect(struct dcc_hostdef *host,
+                              int *to_net_fd,
+                              int *from_net_fd,
+                              pid_t *ssh_pid);
 
 
 int dcc_pick_host_from_list_and_lock_it(struct dcc_hostdef **buildhost,
-                            int *cpu_lock_fd)
+                            int *cpu_lock_fd, int *to_net_fd, int *from_net_fd, pid_t *ssh_pid)
 {
     struct dcc_hostdef *hostlist;
     int ret;
@@ -101,7 +110,7 @@ int dcc_pick_host_from_list_and_lock_it(
         return EXIT_NO_HOSTS;
     }
 
-    return dcc_lock_one(hostlist, buildhost, cpu_lock_fd);
+    return dcc_lock_one(hostlist, buildhost, cpu_lock_fd, to_net_fd, from_net_fd, ssh_pid);
 
     /* FIXME: Host list is leaked? */
 }
@@ -145,11 +154,15 @@ static void dcc_lock_pause(void)
  **/
 static int dcc_lock_one(struct dcc_hostdef *hostlist,
                         struct dcc_hostdef **buildhost,
-                        int *cpu_lock_fd)
+                        int *cpu_lock_fd,
+                        int *to_net_fd,
+                        int *from_net_fd,
+                        pid_t *ssh_pid)
 {
     struct dcc_hostdef *h;
     int i_cpu;
     int ret;
+    int dcc_busy;
 
     while (1) {
         for (i_cpu = 0; i_cpu < 50; i_cpu++) {
@@ -160,9 +173,25 @@ static int dcc_lock_one(struct dcc_hostd
                 ret = dcc_lock_host("cpu", h, i_cpu, 0, cpu_lock_fd);
 
                 if (ret == 0) {
-                    *buildhost = h;
-                    dcc_note_state_slot(i_cpu, strcmp(h->hostname, "localhost") == 0 ? DCC_LOCAL : DCC_REMOTE);
-                    return 0;
+                    //rs_log_warning("local lock done");
+                    //taszluk, make connetion while local lock
+                    //if host will be busy, let's try other one
+                    if (strcmp(h->hostname, "localhost") == 0) {
+                        dcc_busy=0;
+                    } else {
+                        rs_log_info("locking remote host '%s' ", h->hostname);
+                        dcc_busy = dcc_remote_connect(h, to_net_fd, from_net_fd, ssh_pid);
+                    }
+                    if (dcc_busy != 0 ) {
+                       rs_log_error("locking remote host '%s' failed with error code '%d'", h->hostname, dcc_busy);
+                       dcc_unlock(*cpu_lock_fd);
+                       continue;
+                    } else {
+                        //rs_log_warning("locking remote host '%s' success!", h->hostname);
+                        *buildhost = h;
+                        dcc_note_state_slot(i_cpu, strcmp(h->hostname, "localhost") == 0 ? DCC_LOCAL : DCC_REMOTE);
+                        return 0;
+                    }
                 } else if (ret == EXIT_BUSY) {
                     continue;
                 } else {
@@ -186,14 +215,45 @@ int dcc_lock_local(int *cpu_lock_fd)
 {
     struct dcc_hostdef *chosen;
 
-    return dcc_lock_one(dcc_hostdef_local, &chosen, cpu_lock_fd);
+    return dcc_lock_one(dcc_hostdef_local, &chosen, cpu_lock_fd, NULL, NULL, NULL);
 }
 
 int dcc_lock_local_cpp(int *cpu_lock_fd)
 {
     int ret;
     struct dcc_hostdef *chosen;
-    ret = dcc_lock_one(dcc_hostdef_local_cpp, &chosen, cpu_lock_fd);
+    ret = dcc_lock_one(dcc_hostdef_local_cpp, &chosen, cpu_lock_fd, NULL, NULL, NULL);
     dcc_note_state(DCC_PHASE_CPP, NULL, chosen->hostname, DCC_LOCAL);
     return ret;
 }
+
+/**
+ *  * Open a connection using either a TCP socket or SSH.  Return input
+ *   * and output file descriptors (which may or may not be different.)
+ *    **/
+static int dcc_remote_connect(struct dcc_hostdef *host,
+                              int *to_net_fd,
+                              int *from_net_fd,
+                              pid_t *ssh_pid)
+{
+    int ret;
+
+    if (host->mode == DCC_MODE_TCP) {
+        *ssh_pid = 0;
+        if ((ret = dcc_connect_by_name(host->hostname, host->port,
+                                       to_net_fd)) != 0)
+            return ret;
+        *from_net_fd = *to_net_fd;
+        return 0;
+    } else if (host->mode == DCC_MODE_SSH) {
+        if ((ret = dcc_ssh_connect(NULL, host->user, host->hostname,
+                                   host->ssh_command,
+                                   from_net_fd, to_net_fd,
+                                   ssh_pid)))
+            return ret;
+        return 0;
+    } else {
+        rs_log_crit("impossible host mode");
+        return EXIT_DISTCC_FAILED;
+    }
+}
diff --exclude=.svn -rupN distcc-3.2-arankine/src/where.h distcc-3.2rc2_new/src/where.h
--- distcc-3.2-arankine/src/where.h	2014-09-04 11:44:41.000000000 +0200
+++ distcc-3.2rc2_new/src/where.h	2014-10-16 14:01:10.858266311 +0200
@@ -22,8 +22,8 @@
  */
 
 /* where.c */
-int dcc_pick_host_from_list_and_lock_it(struct dcc_hostdef **,
-                                        int *cpu_lock_fd);
+int dcc_pick_host_from_list_and_lock_it(struct dcc_hostdef **buildhost,
+                            int *cpu_lock_fd, int *to_net_fd, int *from_net_fd, pid_t *ssh_pid);
 
 int dcc_lock_local(int *cpu_lock_fd);
 
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Reply via email to