Ian Clarke (ian at locut.us) wrote: > Using VMStat is a cop-out. If we can't find a correlation between any > of the intuitive measurable factors (such as threads active or network > load) then it suggests that something is wrong - and we should figure > out why.
There are two bottlenecks which affect a node's performance: CPU and network bandwidth. (Sure, there's also memory, disk I/O throughput and latency, and possibly others, but for the purposes of this discussion we're only concerned with the factors that affect the majority of nodes/users. To the best of my knowledge, these are CPU and network.) Consider case 1: a Pentium 120 running Fred in a university dorm that hasn't been firewalled yet. It's probably on something like a T1 or a T3. Now consider case 2: a Duron 1.3 running Fred on ADSL with a 256 kbps upstream. Each of these machines is capable of handling a certain number of requests per hour. In case 1, the network is never going to be saturated by Fred alone; the CPU will be the bottleneck. In case 2, the CPU may be overwhelemed from time to time (e.g. FEC decoding) but in general the network will be the bottleneck. (I could have given a "case 2" with even more bias in favor of the network, but my case 2 is a real machine.) > What are we going to do with this load measurement anyway? We will > reduce the number of concurrent requests - yet the whole rationale for > this is that the number of concurrent requests doesn't directly affect > load - so the whole rationale for this falls apart. No, the problem is that the number of requests is a useful measurement, but only if we know how many requests the node is *capable* of. Some nodes can handle 1000 requests per hour and some can handle 100,000 requests per hour. There's no way you can know what the working capacity of a node is. But you can measure the CPU load (with vmstat or other per-platform tools), and if a threshold has been exceeded, you can refuse requests until the load drops to an acceptable level. Unix daemons have been doing this for 2 decades now (I'm thinking primarily of sendmail). This leads us to the other variable: network bandwidth. The only reasonable measure of this that I'm aware of is the "Bytes waiting to be sent" field in the OCM page (/servlet/nodeinfo/networking/ocm). If that value becomes sufficiently large, then once again, you can refuse requests until the outgoing queue is a bit leaner. > The solution to this problem is to do profiling to understand why some > things affect the load more than others, and fortunately some people > have already started work in this area. No, profiling is a solution to a *different* problem, namely: how do we make Fred more efficient? That's a long-term problem, and certainly an important one, but it's separate from what we're talking about right now. Matthew's vmstat proposal is a partial solution to the immediate problem, which is: why are some nodes overloaded and others underutilized? My proposal (building on Matthew's) to address this problem is: 1) Use vmstat or other per-platform tools to determine the CPU load, and throttle back (refuse requests, etc.) if it's too high. 2) Use the "Bytes waiting to be sent" field to determine the network load, and throttle back (refuse requests, etc.) if it's too high. I'm not as familiar with the code as you and Matthew are. I'm willing to accept that my proposal might have flaws, even deadly ones which would make it totally unsuitable. But so far I haven't seen an acceptably powerful rebuttal of our basic position. -- Greg Wooledge | "Truth belongs to everybody." greg at wooledge.org | - The Red Hot Chili Peppers http://wooledge.org/~greg/ | -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20030517/44db6f33/attachment.pgp>