I've followed up my previous crude estimates of node churn with some
more detailed numbers.  (See my mail in re: "data persistence again"
on 20100122 for previous version and more detailed explanation.)

Again, some brief caveats: the following basically assumes that all
samples are independent.  This is quite incorrect, because of time of
day effects.  Nonetheless, I think it's useful.  Many of the obvious
uses for this data ("If an insert is stored on 3 nodes, how likely is
it one of them will be online later?") are strongly impacted by this.
Use appropriate caution in analysis.  Also, I have a few missing
samples; for each sample, I looked at the previous set of 24 samples
that I did have, whether or not those were contiguous.

What I did: for each of the probe request samples, I computed how many
nodes appeared in n of the previous 24 samples (24 samples at 5 hour
intervals is a 5 day window).  I then averaged these counts across
samples.  If an average sample has N_i nodes appearing in i of the
previous 24 samples, then the average sample size over those 24 is
sum(N_i*(i/24)).  Over the 387 samples (ignoring the first 23 where
there aren't a "most recent 24 samples"), I have an average sample
size of 5757.1 nodes.  If we assume that each node is online with
probability i/24, and all nodes are independent (see previous caveat
about this assumption being incorrect), then the number of nodes that
are online in both of two different sampling intervals is
sum(N_i*(i/24)^2).  For this number, I get 3511.5 nodes.  That is, if
you select a random online node at some time t_1, the odds that it
will be online at some later time t_2 are about 0.610.

I then repeated the above using the most recent 72 samples (15 days).
The distributions were roughly similar.  Average sample size was
5824.1, expected nodes online in both of two samples is 3106.8, or a
probability of 0.533 that a randomly chosen node will be online later.

Nodes online in 24 of 24 samples make up 21.9% of an average sample.
Nodes online in 70, 71, or 72 samples make up 13.6%.  Low-uptime nodes
(< 40% according to sink logic; here taken as <= 9 samples of 24 or <=
27 of 72 (to make the 24/72 numbers directly comparable)) are 30.8% on
the 24-sample data, and 37.7% on the 72-sample data.  I believe both
of these discrepancies result from join/leave churn, whether permanent
or over medium time periods (ie users who use Freenet for a couple
hours or days every few weeks).

Evan Daniel

(If you want the full spreadsheet or raw data, ask.  The spreadsheet
was nearly 0.5 MiB, so I didn't attach it.  The averaged counts are
below; this is enough to reproduce my calculations assuming samples
are independent.)

Data summaries:

24 samples:

n_samples n_nodes_average
1       2743.1
2       1783.0
3       1409.7
4       1187.3
5       1030.5
6       908.8
7       792.4
8       700.2
9       616.1
10      552.7
11      488.4
12      434.6
13      388.3
14      348.4
15      315.9
16      274.7
17      239.9
18      212.3
19      196.4
20      190.9
21      192.1
22      218.8
23      257.8
24      1257.7


72 samples:

n_samples n_nodes_average
1       3309.0
2       1780.6
3       1353.9
4       1141.8
5       994.1
6       855.5
7       770.1
8       700.5
9       643.2
10      597.4
11      553.1
12      517.5
13      482.2
14      453.8
15      431.6
16      411.5
17      387.7
18      368.8
19      352.9
20      330.5
21      317.9
22      301.3
23      290.1
24      269.5
25      254.9
26      241.8
27      231.8
28      225.7
29      214.7
30      204.7
31      195.2
32      184.0
33      176.6
34      169.2
35      160.2
36      150.4
37      145.6
38      140.8
39      133.9
40      130.5
41      127.3
42      121.4
43      117.4
44      112.4
45      105.5
46      101.4
47      97.5
48      94.5
49      91.8
50      86.9
51      82.8
52      78.5
53      75.9
54      73.9
55      70.5
56      67.6
57      67.0
58      67.9
59      63.7
60      62.1
61      62.5
62      64.2
63      67.4
64      66.6
65      69.1
66      68.5
67      74.7
68      84.6
69      95.6
70      111.4
71      159.1
72      527.9

Reply via email to