[freenet-dev] Statistics and measurement plans

Evan Daniel Mon, 7 Sep 2009 12:59:05 -0400

On Mon, Sep 7, 2009 at 9:17 AM, Matthew
Toseland<toad at amphibian.dyndns.org> wrote:
> On Sunday 06 September 2009 23:51:48 Evan Daniel wrote:
>> I've been giving some thought to a plan for how to measure the
>> performance of Freenet in a statistically valid fashion, with enough
>> precision that we can assess whether a change helped or hurt, and by
>> how much. ?Previous changes, even fairly significant ones like FOAF
>> routing and variable peer counts, have proved difficult to assess.
>> These are my current thoughts on measuring Freenet; comments, whether
>> general or specific, would be much appreciated. ?The problem is hard,
>> and my knowledge of statistics is far from perfect. ?I'll be writing
>> another email asking for volunteers to collect data shortly, but I
>> want to do a little more with my stats collection code first.
>>
>> Measuring Freenet is hard. ?The common complaint is that the data is
>> too noisy. ?This isn't actually that problematic; extracting low-level
>> signals from lots of noise just requires lots of data and an
>> appropriate statistical test or two. ?What makes testing Freenet
>> really hard is that not only is the data noisy, collecting it well is
>> difficult. ?For starters, we have good reason to believe that there
>> are strong effects of both time of day and day of week. ?Node uptime
>> may matter, both session uptime and past history. ?Local node usage is
>> likely to vary, and probably causes variations in performance with
>> respect to remote requests as well. ?Because of security concerns, we
>> can't collect data from all nodes or even a statistically valid sample
>> of nodes.
>>
>> At present, my plan is to collect HTL histograms of request counts and
>> success rates, and log the histograms hourly, along with a few other
>> stats like datastore size, some local usage info, and uptime. ?My
>> theory is that although the data collection nodes do not represent a
>> valid sample, the requests flowing into them should. ?Specifically,
>> node locations and request locations are well distributed, in a manner
>> that should be entirely uncorrelated with whether a node is a test
>> node or whether a request gets routed to a test node. ?Higher
>> bandwidth nodes route more requests overall, and node bandwidth
>> probably shows sampling bias, but that should impact requests equally,
>> independent of what key is being requested. ?There may be some bias in
>> usage patterns, and available bandwidth may create a bias among peers
>> chosen that correlates with usage patterns and with being a test node.
>> ?In order to reduce these effects, I currently plan to use only the
>> data from HTL 16 and below; in my experiments so far, on my node, the
>> htl 18 and 17 data exhibits far more variation between sampling
>> intervals.
>>
>> My current plan for data collection goes like this. ?Collect data from
>> before a change, binned hourly. ?When a new build is released, first
>> give the network a day or three to upgrade and stabilize, ignoring the
>> data during the upgrade period. ?Then, collect some more data. ?For
>> each participating node, take the data from the set of hours of the
>> week during which the node was running both before and after the
>> change, and ignore other hours. ?(If node A was running and gather
>> data on Monday for the 09:00 hour both before and after, but only
>> gathering data one week on Monday for the 10:00 hour, then we only
>> look at the 09:00 hour data.)
>>
>> Then, I need to perform some sort of non-parametric test on the data
>> to see whether the 'before' data is different from the 'after' data.
>> Currently I'm looking at one of Kruskal-Wallis one-way ANOVA, Wilcoxon
>> signed-rank, or MWW. ?I'm not yet sure which is best, and I may try
>> several approaches. ?I'll probably apply the tests to each distinct
>> htl separately, with appropriate multiple-tests corrections to the
>> p-values.
>>
>> I also need to determine exactly what changes I expect to see. ?For
>> example, if a change makes the network better at finding data, then we
>> expect more requests that are sent to succeed. ?This may mean that
>> success rates go up at all htls. ?Or, it may mean that requests
>> succeed earlier, meaning that the low-htl requests contain fewer
>> requests for 'findable' data. ?So an improvement to the network might
>> result in a decrease in low-htl success rates. ?Roughly speaking, a
>> change that improves the number of hops required to find data should
>> improve success rates at low htl and decrease them at low htl, but a
>> change that means more data becomes findable should improve them at
>> all htls. ?I expect that most changes would be a mix of the two.
>> Furthermore, I have to decide on how to treat local vs remote success
>> rates. ?The local success rate exhibits a strong bias with things like
>> node age and datastore size. ?However, the bias carries over into
>> remote success rates as well -- more local successes means that
>> requests that don't succeed will tend to be 'harder' requests. ?Taking
>> the global success rate is probably still heavily biased.
>>
>> One approach would be to look only at the incoming request counts.
>> Incoming request counts are only influenced by effects external to the
>> node, and therefore less subject to sampling bias. ?Averaged across
>> the network, the decrease in incoming requests from one htl to the
>> next (for the non-probabilistic drop htls, or with appropriate
>> corrections) represents the number of requests that succeeded at the
>> higher htl. ?However, this does not account for rejected incoming
>> requests, which decrement the htl at the sending node without
>> performing a useful function. ?(This will get even more complicated
>> with bug 3368 changes.)
>>
>> My current plan is to look at global success rates, as they combine
>> whether the request has been routed to the right node (where it
>> results in a local success) and whether it gets routed properly in the
>> future (remote success). ?As we expect new nodes to become better at
>> serving requests as their cache and store fill up, I plan to only make
>> use of data from established nodes (for some undecided definition of
>> established).
>
> Very interesting! We should have tried to tackle this problem a long time 
> ago...
>
> We have a few possible changes queued that might be directly detectable:
> - A routing level change related to loop detection. Should hopefully increase 
> success rates / reduce hops taken, but only fractionally. This may however be 
> detectable given the above...


I hope it will detectable; I'm less certain that we'll know how to
interpret it, or be confident that we haven't missed some other cause
for a change (if we see one).

> - Bloom filter sharing. Hopefully this will increase success rates at all 
> levels, making more content available, however there is a small overhead.
>
> There is also work to be done on the client level:
> - MHKs (multiple top blocks).
> - Various changes to splitfiles: add some extra check blocks for non-full 
> segments, split them evenly, etc.
> - Reinserting the top block on fetching a splitfile where it took some time 
> to find it.
>
> These should not have much of an effect on the routing level - if they have 
> an effect it is probably negative. However they should have a significant 
> positive effect on success rates.

If they make a dramatic change to user-level file success rates,
they'll have an indirect impact at the routing level.  Right now
people leave files queued for a long time, making many requests for
unfindable keys.  If the file success rates improve, that will happen
less, and people will queue more newer files.  Both will cause an
improvement in routing-level success rates.

>
> So far the only tool we have for measuring this is the LongTermPushPullTest. 
> This involves inserting a 64KB splitfle to an SSK and then pulling after 
> (2^n)-1 days for various n (the same key is never requested twice).
>
> This is currently giving significantly worse results for 3 days than for 7 
> days.
>
> 0 days (= push-pull test): 13 samples, all success. (100% success)
> 1 days: 11 success, 1 Splitfile error (meaning RNFs/DNFs). (92% success)
> 3 days: 4 success, 3 Data not found, 3 Not enough data found (40% success)
> 7 days: 6 success, 3 Data not found (2 DNFs not eligible because nothing was 
> inserted 7 days prior) (66.7% success)
> 15 days: 1 success 1 Not enough data found (50% success)
>
> There isn't really enough data here, but it is rather alarming.
>
> Any input? How much data is enough data? Other ways to measure?

I don't think you have enough data to distinguish the 3/7/15 day
numbers from each other, but I haven't run the tests yet.  I think you
can say that the 0 / 1 day results do better than the longer term ones
(but again, no tests yet).  I'll check tests and get back to you.

In general, it's difficult to establish a difference without at least
3 entries in each category (3+ successes, 3+ fails) for one of the
groups, and ~10 samples in each group.  Sufficiently extreme data is
an exception, of course.

I'll think about other things to measure and follow up this email.

Evan Daniel

[freenet-dev] Statistics and measurement plans

Reply via email to