Benchmarking the parallel downloader
====================================

Test transaction: Download packages from F14 matching the pattern 'a[a-f]'.
It's 117 packages, about 56 MB in total.  Downloader 'modes': Every test 
setup ran in 4 different downloader configs:

compat = old urlgrabber code (single process, blocking)
direct = single-process parallel downloads using curlMulti.
extern = curlMulti in a single separate process.
pooled = many-process model, no curlMulti.

To get results as close to reality as possible, I've not used any network
shaping.  Network performance inherently varies over time, so every test
'batch' was repeated 3 times.

Test setup 1
============

This is the default config.  No fastest-mirror, no mirror shuffling, just use
the metalink.

compat:    1.672   1.668   1.222
direct:    1.741   1.777   1.240
extern:    1.888   1.790   1.517
pooled:    1.512   1.720   1.656

'download.englab.brq.redhat.com' is at the top of the list.  It's about 200
times faster than any of the other mirrors.  Download is very fast in all
modes, CPU being the bottleneck.

Test setup 2
============

Removed download.englab.brq.redhat.com from metalink.xml to get more
real-world conditions.

compat:  486.189 232.338 136.590
direct:  355.605 210.751 132.595
extern:  208.853 205.023 153.765
pooled:  194.219 227.433 199.849

Download times do not vary much wrt 'mode', because all downloads use the same
mirror 'mirror.its.uidaho.edu' that was 2nd in the list, and it allows only 1
simultaneous connection, so 'parallel' downloader uses just 1 connection anyway.

Test setup 3
============

Added a simple 'mirror sweep' to urlgrabber/mirror.py, that advances the master
mirror index after async urlgrab() is issued.  Only first 5 mirrors are cycled
as their preference/speed (as assigned by the metalink server) is decreasing,
and I don't want to hit the 'slow' ones unless the 'fast' ones fail.

compat:  358.594 429.342 291.456
direct:  108.089 119.010 142.701
extern:   83.734 138.664 179.561
pooled:   89.178 122.972 133.171

In this case the parallel downloader is 3+ times faster than the compat code.

Conclusions
===========

1) There's no much difference in the performance of direct, extern and pooled
modes.  I don't think the user should control this in config.  Using 'pooled'
as default (and possibly removing all other modes from urlgrabber) is probably
a way to go.  

2) I don't like two constants that the code uses atm.  They are quite important
but probably don't warrant a config option.

DEFAULT_MAX_CONNECTIONS = 3

metalink.xml specifies the maximum number of simultaneous connections for a
particular mirror, but mirrorlist does not.  We need a suitable default.

MIRROR_SWEEP = 5

Instead of using the 'first' mirror for all urlgrab requests, cycle the master
index through the first N mirrors.  A value too small limits the total number
of connections, while a large value instructs the downloader to use 'slow',
distant mirrors.

3) Also noticed a possible problem with the 'direct' downloader (uses 
curlMulti right within the 'yum' process).  It brings CPU to 100%, as it
likely busy-loops in the 'perform' method.  Seems easy to fix though.
_______________________________________________
Yum-devel mailing list
[email protected]
http://lists.baseurl.org/mailman/listinfo/yum-devel

Reply via email to