I've followed up my previous crude estimates of node churn with some more detailed numbers. (See my mail in re: "data persistence again" on 20100122 for previous version and more detailed explanation.)
Again, some brief caveats: the following basically assumes that all samples are independent. This is quite incorrect, because of time of day effects. Nonetheless, I think it's useful. Many of the obvious uses for this data ("If an insert is stored on 3 nodes, how likely is it one of them will be online later?") are strongly impacted by this. Use appropriate caution in analysis. Also, I have a few missing samples; for each sample, I looked at the previous set of 24 samples that I did have, whether or not those were contiguous. What I did: for each of the probe request samples, I computed how many nodes appeared in n of the previous 24 samples (24 samples at 5 hour intervals is a 5 day window). I then averaged these counts across samples. If an average sample has N_i nodes appearing in i of the previous 24 samples, then the average sample size over those 24 is sum(N_i*(i/24)). Over the 387 samples (ignoring the first 23 where there aren't a "most recent 24 samples"), I have an average sample size of 5757.1 nodes. If we assume that each node is online with probability i/24, and all nodes are independent (see previous caveat about this assumption being incorrect), then the number of nodes that are online in both of two different sampling intervals is sum(N_i*(i/24)^2). For this number, I get 3511.5 nodes. That is, if you select a random online node at some time t_1, the odds that it will be online at some later time t_2 are about 0.610. I then repeated the above using the most recent 72 samples (15 days). The distributions were roughly similar. Average sample size was 5824.1, expected nodes online in both of two samples is 3106.8, or a probability of 0.533 that a randomly chosen node will be online later. Nodes online in 24 of 24 samples make up 21.9% of an average sample. Nodes online in 70, 71, or 72 samples make up 13.6%. Low-uptime nodes (< 40% according to sink logic; here taken as <= 9 samples of 24 or <= 27 of 72 (to make the 24/72 numbers directly comparable)) are 30.8% on the 24-sample data, and 37.7% on the 72-sample data. I believe both of these discrepancies result from join/leave churn, whether permanent or over medium time periods (ie users who use Freenet for a couple hours or days every few weeks). Evan Daniel (If you want the full spreadsheet or raw data, ask. The spreadsheet was nearly 0.5 MiB, so I didn't attach it. The averaged counts are below; this is enough to reproduce my calculations assuming samples are independent.) Data summaries: 24 samples: n_samples n_nodes_average 1 2743.1 2 1783.0 3 1409.7 4 1187.3 5 1030.5 6 908.8 7 792.4 8 700.2 9 616.1 10 552.7 11 488.4 12 434.6 13 388.3 14 348.4 15 315.9 16 274.7 17 239.9 18 212.3 19 196.4 20 190.9 21 192.1 22 218.8 23 257.8 24 1257.7 72 samples: n_samples n_nodes_average 1 3309.0 2 1780.6 3 1353.9 4 1141.8 5 994.1 6 855.5 7 770.1 8 700.5 9 643.2 10 597.4 11 553.1 12 517.5 13 482.2 14 453.8 15 431.6 16 411.5 17 387.7 18 368.8 19 352.9 20 330.5 21 317.9 22 301.3 23 290.1 24 269.5 25 254.9 26 241.8 27 231.8 28 225.7 29 214.7 30 204.7 31 195.2 32 184.0 33 176.6 34 169.2 35 160.2 36 150.4 37 145.6 38 140.8 39 133.9 40 130.5 41 127.3 42 121.4 43 117.4 44 112.4 45 105.5 46 101.4 47 97.5 48 94.5 49 91.8 50 86.9 51 82.8 52 78.5 53 75.9 54 73.9 55 70.5 56 67.6 57 67.0 58 67.9 59 63.7 60 62.1 61 62.5 62 64.2 63 67.4 64 66.6 65 69.1 66 68.5 67 74.7 68 84.6 69 95.6 70 111.4 71 159.1 72 527.9