[ 
https://issues.jenkins-ci.org/browse/JENKINS-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=162622#comment-162622
 ] 

David Reiss commented on JENKINS-3922:
--------------------------------------

This issue affected us in a big way once we moved our slaves to a remote 
datacenter.  From the descriptions, it seems like not everyone has the same 
problem that we did, but I'll explain how we fixed it.

h3. Diagnostics

- Make sure you can log into your master and run "scp slave:somefile ." and get 
the bandwidth that you expect.  If not, Jenkins is not your problem.  Check out 
http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php if you are on a 
high-latency link.
- Compute your bandwidth-delay product.  This is the bandwidth you get from a 
raw scp in bytes per second times the round-trip-time you get from ping.  In my 
case, this was about 4,000,000 (4 MB/s) * 0.06 (60ms) = 240,000 bytes (240 kB).
- *If you are using ssh slaves and your BDP is greater than 16 kB, you are 
definitely having the same problem that we were.*  This is the trilead ssh 
window problem.
- If you are using any type of slave and your BDP is greater than 128 kB, then 
you are also affected by the jenkins remoting pipe window problem.

h3. trilead ssh window problem

The ssh-slaves-plugin uses the trilead ssh library to connect to the slaves.  
Unfortunately, that library uses a hard-coded 30,000-byte receive buffer, which 
limits the amount of in-flight data to 30,000 bytes.  In practice, the 
algorithm it uses for updating its receive window rounds that down to a power 
of two, so you only get 16kB.

I created a pull request at https://github.com/jenkinsci/trilead-ssh2/pull/1 to 
make this configurable at JVM startup time.  Making this window large increased 
our bandwidth by a factor of almost 8.  Note that two of these buffers are 
allocated for each slave, so turning this up can consume memory quickly if you 
have several slaves.  In our case, we have memory to spare, so it wasn't a 
problem.  It might be useful to switch to another ssh library that allocates 
window memory dynamically. 

Fixing this will get your BDP up to almost 128kB, but beyond that, you run into 
another problem.

h3. jenkins remoting pipe window problem

The archiving process uses a hudson.remoting.Pipe object to send the data back. 
 This object uses flow control to avoid overwhelming the receiver.  By default, 
it only allows 128kB of in-flight data.  There is already a system property 
that controls this constant, but it has a space in its name, which makes it a 
bit complicated to set.  I created a pull request at 
https://github.com/jenkinsci/remoting/pull/4 to fix the name.

Note that this property must be set on the *slave*'s JVM, not the master's.  
Therefore, to set it, you must go into your ssh slave configuration, open the 
advanced button, find the "JVM Options" input, and enter "-Dclass\ 
hudson.remoting.Channel.pipeWindowSize=1234567" (no quotes, change the number 
to whatever is appropriate for your environment).  If my pull request is 
accepted, this will change to 
"-Dhudson.remoting.Channel.pipeWindowSize=1234567".  Note that this window is 
not preallocated, so you can make this number fairly large and excess memory 
will not be consumed unless the master is unable to keep up with data from the 
slave.

Increasing both of these windows increased our bandwidth by a factor about 15, 
matching the 4MB/s we were getting from raw scp.

h3. Good luck!

                
> Slave is slow copying maven artifacts to master
> -----------------------------------------------
>
>                 Key: JENKINS-3922
>                 URL: https://issues.jenkins-ci.org/browse/JENKINS-3922
>             Project: Jenkins
>          Issue Type: Bug
>          Components: master-slave
>    Affects Versions: current
>         Environment: Platform: All, OS: All
>            Reporter: John McNair
>            Assignee: Kohsuke Kawaguchi
>            Priority: Critical
>         Attachments: pom.xml
>
>
> The artifact transfer is currently a 3-4x penalty for the project that I am
> working on.  I have reproduced the issue with a simple test pom that does
> nothing but jar hudson.war.  I performed this test on a heterogeneous
> environment.  Both master and slave are running Fedora 10, but the master is a
> faster machine.  Still, it highlights the issue.
> Here are some stats (all stats are after caching dependencies in the local 
> repos):
> Master build through Hudson: 19s
> Master build from command line (no Hudson): 9s
> Slave build through Hudson: 1m46s
> Slave build from command line (no Hudson): 16s
> To be fair we should at least add time to do a straight scp of the artifact 
> from
> slave to master.  The two nodes share a 100 Mbit switch:
> $ scp target/slow-rider-1.0.0-SNAPSHOT.jar master_node:
> slow-rider-1.0.0NAPSHOT.jar  100%   25MB  12.7MB/s   00:02
> Of course this example exaggerates the issue to make it more clear but not by
> too much.  I originally noticed this in a completely separate environment that
> was all virtual.  I reproduced this on two physical machines using a different
> switch and different ethernet drivers (both virtual and physical).  The
> reproducibility plus the comparison against command line + scp leads me to
> suspect eager flushing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.jenkins-ci.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to