[freenet-dev] Data persistence again was Re: Insert on demand was Re: Uservoice update

Evan Daniel Thu, 21 Jan 2010 14:33:29 -0500

On Thu, Jan 21, 2010 at 12:38 PM, Matthew Toseland
<toad at amphibian.dyndns.org> wrote:
> On Wednesday 20 January 2010 15:44:22 Evan Daniel wrote:
>> On Wed, Jan 20, 2010 at 8:54 AM, Matthew Toseland
>> <toad at amphibian.dyndns.org> wrote:
>>
>> > 4) Capacity. IMHO if Freenet is working well we should not need insert on 
>> > demand: Its capacity should be much greater than it is now, and we should 
>> > be able to just insert and fetch the data.
>>
>> Actually, I'm not completely convinced this is a problem. ?At present,
>> I believe most of our data persistence issues stem from node churn,
>> not blocks falling out of individual stores. ?Right now that's just a
>> (somewhat justified) hunch, but I have sufficient data to investigate
>> in more detail.
>
> Okay, lets consider your old data (from flog):
>
> ==
> That says to me that approximately 38% of users are "occasional" users, who 
> either only run their node some of the time, or install, run briefly, and 
> then uninstall. ?A further 23% are dedicated users ? they have their nodes on 
> all the time. ?The remaining 38% (in 1 or 2 samples) I'll call "regular" 
> users ? they frequently have their node running, but not always.
>
> Obviously, these classifications are very rough. ?I'd say a 1:2:2 ratio is 
> probably a reasonable guess, but it could still be rather far off. ?I need to 
> take data more regularly and for a longer period of time before any serious 
> conclusions can be drawn. ?However, I am comfortable saying the following: 
> Freenet has at least 4000 semi-regular or regular users (probably 
> meaningfully more). ?Freenet probably has between 8000 and 12000 total users 
> (the upper bound I'm far less certain of ? if a lot of people only run 
> Freenet for an hour or two per day, it could be far higher). ?At most, about 
> a third of users run their node 24/7; the actual number is probably well 
> under that.
>
> I think this has several practical implications. ?First, we need to be 
> working on data retention more, with a focus on retention despite low-uptime 
> nodes. ?(See bugs 3495, 3514 for a start on that. ?2933 should also help. 
> ?3637/3639 and the like address more general routing issues; that should help 
> as well.) ?Second, we need to figure out how to get these low-uptime nodes 
> back onto the network, and connected usefully, so that the data they have can 
> be found (and to improve the performance for such users). ?(See 3583 and 
> related bugs for one approach.) ?And, finally, we have the general problem of 
> getting (and keeping!) more users.
> ==
>
> So lets say it's reasonable to assume that 40% are 24/7, 20% are newbies, and 
> 40% are maybe an average of 50% uptime? Can you give a rough figure for the 
> average uptime in the "regular" group?


Unfortunately, that's overly optimistic, I think.

Executive summary: of total users, it's more like 14%:21%:65%
(dedicated:regular:occasional/newbie).  For nodes online at any one
time, 37%:33%:30%.  Uptime averages are 96%:57%:17% for the three
groups.

Detailed analysis:

This is all made harder when you start looking at the data in detail.
There is no clear boundary between groups that are always on, mostly
on, sometimes on, occasionally on, etc.

First, let's define the term "users".  I don't think the precise
definition matters much, so long as we're consistent.  I happen to
like using the count of unique nodes that appear in more than one of
the most recent 24 samples (5 days) of my probe data.  That gives
typical user counts of ~14000.  Typical instantaneous network size is
~6000, for an average uptime among repeat nodes of about 43%.

Looking at the data from the most recent sample, here are counts of
how many nodes appeared in n of the last 24 samples:

24  1353
23  275
22  228
21  173
20  169
19  202
18  218
17  246
16  273
15  314
14  318
13  390
12  418
11  480
10  522
9   599
8   731
7   790
6   930
5   978
4   1143
3   1315
2   1595
1   2385

Clearly, there is a significant core of users that leaves their node
on most of the time.  It's tempting to classify the nodes appearing in
24/24 samples as one group, and everyone else as one (or more) other
groups.  However, if you extend the number of samples backward, it
becomes clear that there are 99% uptime nodes, and 98% uptime nodes,
in nonzero numbers.  (Some of that may be a sampling artifact -- I
suspect I don't get a complete census on every sampling interval.)

The following *includes* the single-sample users, even though I don't
normally count them in network size estimates.  I think the math makes
more sense this way.

We have 16045 users total (as of that sample); 1353 of them, or 8.4%,
in 24/24 samples.  There's a minimum in the curve at 20 samples.
Counting all nodes appearing in 20 or more samples (83% or more), we
have 2198 users, or 13.7%.  That seems a reasonable group to call
"regular".  They average 96% uptime.

The next obvious group is single-sample users (2385, or 14.9%).
Again, this isn't as clear a grouping as it looks at first glance.
So, lets instead group at <40% uptime -- this is the breakpoint used
in the sink determination code, iirc.  It's not exactly analogous,
since this is over 5 days instead of 2, but it's close.  So that's 9
or fewer samples, or 10466 users (65.2%).  They average 16.5% uptime.

The remaining users (10-19 samples, 3381 users, 21.1%) average 56.8% uptime.

Taken together, that says an average network has 5759 nodes online,
with 36.8% high-uptime nodes, 33.3% medium-uptime nodes, and 29.9%
low-uptime nodes.

>
> Clearly 3639 needs to be fixed. Most of the other bugs you mention have been 
> fixed already. However, what it comes down to really is we need more (or 
> better) redundancy if we want stuff to be available immediately (after it has 
> been on the network for long enough that it is only in stores and not caches).

That's important, but so is getting low-uptime nodes back on the
network *and with good, routable connections* rapidly.  Occasional
users represent a large fraction of our user base.  Time they spend
with Freenet running, but either not connected or poorly connected, is
time that the data they have is unavailable but could be.  (One could
say that they'll run it for long enough to do what they want, so they
run it for a specific time period after the startup transient.  I
would counter that if it starts faster, they'll like the experience
better and therefore do more with it.)

>
> That means:
> - Considering more store-level block-level redundancy. IMHO we have largely 
> exhausted this once we have fixed 3639: We don't want excessive block-level 
> redundancy because there are other options which are better.

There are a few other possibilities, like the queuing / acceptance
changes discussed in the other thread.  Also, from this breakdown,
Bloom filter sharing belongs in this category and might be
significant.  If a node was a sink, and then some new nodes join or
come online with better locations, the old sink might no longer see
the request, despite being properly connected and therefore very close
to the route the request takes.  Similarly, on darknet, if the sink's
location has changed slightly then it might be near but not on the
request path.

I also think we should give thought to the definition of low-uptime.
For example, https://bugs.freenetproject.org/view.php?id=2292 suggests
reporting absolute uptime, or at least a longer-term average, to our
peers.

> - Considering more or better splitfile-level redundancy. Fixing the splitting 
> problems, possibly increasing the FEC codes, possibly introducing an 
> interlocking code as on CDs. Wuala uses 517% FEC and they still have to have 
> backup servers in practice; we have block level redundancy, but it's still 
> worth seriously considering ...

517% seems excessive, but 100% might be too low.  Of course, our 2x
redundancy combined with 3x block-level redundancy (approximate sink
count) is similar in total.  Interlocking FEC codes is tricky, both
because getting the math right is nontrivial and because it's a patent
minefield.  I think we can navigate the patents by duplicating
*exactly* a strategy that is fully described in *expired* patents.
That's not too hard, but it requires a fair amount of research.  Most
of the original math on high-reliability Reed-Solomon coding (what we
use) dates to the 70s or so.

There are also other options, such as delayed multiple-insert (insert
the block again several hours later, when a different portion of the
network is online; doing this securely, or at all from a low-uptime
node, might be a challenge).

> - More or better top block redundancy. I need to look at the MHK tester 
> results: Is inserting the same block 3 times better or worse than inserting 3 
> different blocks? See my other mail!
>
> IMHO improving data persistence is *THE* way we are going to get Freenet to 
> the point where it is really usable. It should be a priority, although 
> clearly we can't do everything before 0.8.0. Of course there are other issues 
> - ease of use, fixing filesharing, etc. But IMHO make it work and they will 
> come.
>

I would say the Windows installer, low-uptime performance, and data
persistance are all very important.  The installer highest, the other
two are closely related and similar priority.

Evan Daniel

[freenet-dev] Data persistence again was Re: Insert on demand was Re: Uservoice update

Reply via email to