Re: [freenet-dev] Data persistence again was Re: Insert on demand was Re: Uservoice update

Matthew Toseland Fri, 22 Jan 2010 09:30:28 -0800

On Thursday 21 January 2010 19:33:29 Evan Daniel wrote:
> On Thu, Jan 21, 2010 at 12:38 PM, Matthew Toseland
> <t...@amphibian.dyndns.org> wrote:
> > On Wednesday 20 January 2010 15:44:22 Evan Daniel wrote:
> >> On Wed, Jan 20, 2010 at 8:54 AM, Matthew Toseland
> >> <t...@amphibian.dyndns.org> wrote:
> >>
> >> > 4) Capacity. IMHO if Freenet is working well we should not need insert 
> >> > on demand: Its capacity should be much greater than it is now, and we 
> >> > should be able to just insert and fetch the data.
> >>
> >> Actually, I'm not completely convinced this is a problem.  At present,
> >> I believe most of our data persistence issues stem from node churn,
> >> not blocks falling out of individual stores.  Right now that's just a
> >> (somewhat justified) hunch, but I have sufficient data to investigate
> >> in more detail.
> >
> > Okay, lets consider your old data (from flog):
> >
> > ==
> > That says to me that approximately 38% of users are "occasional" users, who 
> > either only run their node some of the time, or install, run briefly, and 
> > then uninstall.  A further 23% are dedicated users — they have their nodes 
> > on all the time.  The remaining 38% (in 1 or 2 samples) I'll call "regular" 
> > users — they frequently have their node running, but not always.
> >
> > Obviously, these classifications are very rough.  I'd say a 1:2:2 ratio is 
> > probably a reasonable guess, but it could still be rather far off.  I need 
> > to take data more regularly and for a longer period of time before any 
> > serious conclusions can be drawn.  However, I am comfortable saying the 
> > following: Freenet has at least 4000 semi-regular or regular users 
> > (probably meaningfully more).  Freenet probably has between 8000 and 12000 
> > total users (the upper bound I'm far less certain of — if a lot of people 
> > only run Freenet for an hour or two per day, it could be far higher).  At 
> > most, about a third of users run their node 24/7; the actual number is 
> > probably well under that.
> >
> > I think this has several practical implications.  First, we need to be 
> > working on data retention more, with a focus on retention despite 
> > low-uptime nodes.  (See bugs 3495, 3514 for a start on that.  2933 should 
> > also help.  3637/3639 and the like address more general routing issues; 
> > that should help as well.)  Second, we need to figure out how to get these 
> > low-uptime nodes back onto the network, and connected usefully, so that the 
> > data they have can be found (and to improve the performance for such 
> > users).  (See 3583 and related bugs for one approach.)  And, finally, we 
> > have the general problem of getting (and keeping!) more users.
> > ==
> >
> > So lets say it's reasonable to assume that 40% are 24/7, 20% are newbies, 
> > and 40% are maybe an average of 50% uptime? Can you give a rough figure for 
> > the average uptime in the "regular" group?
> 
> Unfortunately, that's overly optimistic, I think.
> 
> Executive summary: of total users, it's more like 14%:21%:65%
> (dedicated:regular:occasional/newbie).


Ouch.

> For nodes online at any one 
> time, 37%:33%:30%.  Uptime averages are 96%:57%:17% for the three
> groups.

Would the following question be a useful way to quantise this?

"If we pick a random node online at time T, what is the probability of it being 
online at time T1, averaged over all interesting T1" ?

Maybe we can get an idea of required redundancy from this. Although the 40% 
criterion may change things to be a bit better than expected from this approach?
> 
> Detailed analysis:
> 
> This is all made harder when you start looking at the data in detail.
> There is no clear boundary between groups that are always on, mostly
> on, sometimes on, occasionally on, etc.
> 
> First, let's define the term "users".  I don't think the precise
> definition matters much, so long as we're consistent.  I happen to
> like using the count of unique nodes that appear in more than one of
> the most recent 24 samples (5 days) of my probe data.  That gives
> typical user counts of ~14000.  Typical instantaneous network size is
> ~6000, for an average uptime among repeat nodes of about 43%.
> 
> Looking at the data from the most recent sample, here are counts of
> how many nodes appeared in n of the last 24 samples:
> 
> 24  1353
> 23  275
> 22  228
> 21  173
> 20  169
> 19  202
> 18  218
> 17  246
> 16  273
> 15  314
> 14  318
> 13  390
> 12  418
> 11  480
> 10  522
> 9   599
> 8   731
> 7   790
> 6   930
> 5   978
> 4   1143
> 3   1315
> 2   1595
> 1   2385
> 
> Clearly, there is a significant core of users that leaves their node
> on most of the time.  It's tempting to classify the nodes appearing in
> 24/24 samples as one group, and everyone else as one (or more) other
> groups.  However, if you extend the number of samples backward, it
> becomes clear that there are 99% uptime nodes, and 98% uptime nodes,
> in nonzero numbers.  (Some of that may be a sampling artifact -- I
> suspect I don't get a complete census on every sampling interval.)
> 
> The following *includes* the single-sample users, even though I don't
> normally count them in network size estimates.  I think the math makes
> more sense this way.
> 
> We have 16045 users total (as of that sample); 1353 of them, or 8.4%,
> in 24/24 samples.  There's a minimum in the curve at 20 samples.
> Counting all nodes appearing in 20 or more samples (83% or more), we
> have 2198 users, or 13.7%.  That seems a reasonable group to call
> "regular".  They average 96% uptime.
> 
> The next obvious group is single-sample users (2385, or 14.9%).
> Again, this isn't as clear a grouping as it looks at first glance.
> So, lets instead group at <40% uptime -- this is the breakpoint used
> in the sink determination code, iirc.  It's not exactly analogous,
> since this is over 5 days instead of 2, but it's close.  So that's 9
> or fewer samples, or 10466 users (65.2%).  They average 16.5% uptime.

Ouch.
> 
> The remaining users (10-19 samples, 3381 users, 21.1%) average 56.8% uptime.
> 
> Taken together, that says an average network has 5759 nodes online,
> with 36.8% high-uptime nodes, 33.3% medium-uptime nodes, and 29.9%
> low-uptime nodes.
> 
> >
> > Clearly 3639 needs to be fixed. Most of the other bugs you mention have 
> > been fixed already. However, what it comes down to really is we need more 
> > (or better) redundancy if we want stuff to be available immediately (after 
> > it has been on the network for long enough that it is only in stores and 
> > not caches).
> 
> That's important, but so is getting low-uptime nodes back on the
> network *and with good, routable connections* rapidly.  Occasional
> users represent a large fraction of our user base.  Time they spend
> with Freenet running, but either not connected or poorly connected, is
> time that the data they have is unavailable but could be.  (One could
> say that they'll run it for long enough to do what they want, so they
> run it for a specific time period after the startup transient.  I
> would counter that if it starts faster, they'll like the experience
> better and therefore do more with it.)

Yes, it is logical that improving opennet re-assimilation will have a 
significant impact on churn-related data persistence performance ...

And of course, it has several other benefits:
- Users are happier.
- Newbies are happier.
- Hit and run is much easier. Of course hit and run is not viable long-term as 
it depends on opennet.

However, it does run the risk of resulting in even more churn with fewer users 
running reasonable uptimes.
> 
> > That means:
> > - Considering more store-level block-level redundancy. IMHO we have largely 
> > exhausted this once we have fixed 3639: We don't want excessive block-level 
> > redundancy because there are other options which are better.
> 
> There are a few other possibilities, like the queuing / acceptance
> changes discussed in the other thread.  

Well, there is more and there is better.

The fascinating thing since I started to write that email is that *better* is 
possible even at the block level, with the new insert data.

I do think we will need MHKs eventually, even if we get massive improvements 
from other areas, but it can wait until we have improved retrievability for the 
splitfiles under the top keys.

> Also, from this breakdown, 
> Bloom filter sharing belongs in this category and might be
> significant.  If a node was a sink, and then some new nodes join or
> come online with better locations, the old sink might no longer see
> the request, despite being properly connected and therefore very close
> to the route the request takes.  Similarly, on darknet, if the sink's
> location has changed slightly then it might be near but not on the
> request path.

Good point, so bloom filter sharing should definitely be on the todo list, 
after the easier stuff is out of the way.
> 
> I also think we should give thought to the definition of low-uptime.
> For example, https://bugs.freenetproject.org/view.php?id=2292 suggests
> reporting absolute uptime, or at least a longer-term average, to our
> peers.

Possibly.
> 
> > - Considering more or better splitfile-level redundancy. Fixing the 
> > splitting problems, possibly increasing the FEC codes, possibly introducing 
> > an interlocking code as on CDs. Wuala uses 517% FEC and they still have to 
> > have backup servers in practice; we have block level redundancy, but it's 
> > still worth seriously considering ...
> 
> 517% seems excessive, but 100% might be too low.  Of course, our 2x
> redundancy combined with 3x block-level redundancy (approximate sink
> count) is similar in total.  Interlocking FEC codes is tricky, both
> because getting the math right is nontrivial and because it's a patent
> minefield.  

We can improve our redundancy significantly here too. For many corner cases, 
better splitting will help, as will a few extra check blocks for smaller 
splitfiles. We could also consider 16-bit codes, although we are limited by 
seeking, memory usage, and whatever the minimum usable packet size for the 
codec is (I think it's pretty small, I did some tests a while back).

> I think we can navigate the patents by duplicating 
> *exactly* a strategy that is fully described in *expired* patents.
> That's not too hard, but it requires a fair amount of research.  Most
> of the original math on high-reliability Reed-Solomon coding (what we
> use) dates to the 70s or so.

I'm sure we can avoid the patents, if only by implementing something as close 
to the CD system as possible.
> 
> There are also other options, such as delayed multiple-insert (insert
> the block again several hours later, when a different portion of the
> network is online; doing this securely, or at all from a low-uptime
> node, might be a challenge).

Delaying inserts on the network (for privacy as much as for persistence) is a 
good idea but hard to implement if most nodes are low uptime, unless it's 
acceptable to have long random delays from such.
> 
> > - More or better top block redundancy. I need to look at the MHK tester 
> > results: Is inserting the same block 3 times better or worse than inserting 
> > 3 different blocks? See my other mail!
> >
> > IMHO improving data persistence is *THE* way we are going to get Freenet to 
> > the point where it is really usable. It should be a priority, although 
> > clearly we can't do everything before 0.8.0. Of course there are other 
> > issues - ease of use, fixing filesharing, etc. But IMHO make it work and 
> > they will come.
> 
> I would say the Windows installer, low-uptime performance, and data
> persistance are all very important.  The installer highest, the other
> two are closely related and similar priority.
> 
> Evan Daniel

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
Devl@freenetproject.org
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Data persistence again was Re: Insert on demand was Re: Uservoice update

Reply via email to