Benchmarking the parallel downloader ==================================== Test transaction: Download packages from F14 matching the pattern 'a[a-f]'. It's 117 packages, about 56 MB in total. Downloader 'modes': Every test setup ran in 4 different downloader configs:
compat = old urlgrabber code (single process, blocking) direct = single-process parallel downloads using curlMulti. extern = curlMulti in a single separate process. pooled = many-process model, no curlMulti. To get results as close to reality as possible, I've not used any network shaping. Network performance inherently varies over time, so every test 'batch' was repeated 3 times. Test setup 1 ============ This is the default config. No fastest-mirror, no mirror shuffling, just use the metalink. compat: 1.672 1.668 1.222 direct: 1.741 1.777 1.240 extern: 1.888 1.790 1.517 pooled: 1.512 1.720 1.656 'download.englab.brq.redhat.com' is at the top of the list. It's about 200 times faster than any of the other mirrors. Download is very fast in all modes, CPU being the bottleneck. Test setup 2 ============ Removed download.englab.brq.redhat.com from metalink.xml to get more real-world conditions. compat: 486.189 232.338 136.590 direct: 355.605 210.751 132.595 extern: 208.853 205.023 153.765 pooled: 194.219 227.433 199.849 Download times do not vary much wrt 'mode', because all downloads use the same mirror 'mirror.its.uidaho.edu' that was 2nd in the list, and it allows only 1 simultaneous connection, so 'parallel' downloader uses just 1 connection anyway. Test setup 3 ============ Added a simple 'mirror sweep' to urlgrabber/mirror.py, that advances the master mirror index after async urlgrab() is issued. Only first 5 mirrors are cycled as their preference/speed (as assigned by the metalink server) is decreasing, and I don't want to hit the 'slow' ones unless the 'fast' ones fail. compat: 358.594 429.342 291.456 direct: 108.089 119.010 142.701 extern: 83.734 138.664 179.561 pooled: 89.178 122.972 133.171 In this case the parallel downloader is 3+ times faster than the compat code. Conclusions =========== 1) There's no much difference in the performance of direct, extern and pooled modes. I don't think the user should control this in config. Using 'pooled' as default (and possibly removing all other modes from urlgrabber) is probably a way to go. 2) I don't like two constants that the code uses atm. They are quite important but probably don't warrant a config option. DEFAULT_MAX_CONNECTIONS = 3 metalink.xml specifies the maximum number of simultaneous connections for a particular mirror, but mirrorlist does not. We need a suitable default. MIRROR_SWEEP = 5 Instead of using the 'first' mirror for all urlgrab requests, cycle the master index through the first N mirrors. A value too small limits the total number of connections, while a large value instructs the downloader to use 'slow', distant mirrors. 3) Also noticed a possible problem with the 'direct' downloader (uses curlMulti right within the 'yum' process). It brings CPU to 100%, as it likely busy-loops in the 'perform' method. Seems easy to fix though. _______________________________________________ Yum-devel mailing list [email protected] http://lists.baseurl.org/mailman/listinfo/yum-devel
