Re: [freenet-dev] (no subject)

Ian Clarke Thu, 14 Aug 2003 16:20:57 -0700

Hi Zooko,

Thanks for your comments - they are extremely interesting and suggest
that there is increasing convergence between the two projects.  I look
forward to continuing to work together to our mutual benefit, however
before continuing, I do want to say that as the former Nigerian Minister
for Penis Enlargement I think it is quite a shame that this email won't
make it to the Mnet mailing list.


Anyway, one high-level point that I think is worth making at the outset
is that it is important to draw a distinction between NGrouting, which
is more of a general principal, and the specifics of our current
implementation.

The general principal of NGrouting is that routing decisions be made on
the basis of which node has the lowest estimated time to return the data
based on historically collected data. The manner in which these
estimates are produced will be implementation specific, and will likely
go unspecified if we ever get around to writing a POSIX-like
specification for what is and is not a Freenet node.

This means that, for example, if we used some form of Machine Learning
algorithm, as has been proposed and which looks promising (although I
have concerns about CPU and memory requirements), then it is still
NGrouting (as I see it) - and still Freenet, albeit an alternative
implementation of Freenet. 

Perhaps this is a rather self-servingly broad definition, but I think
that the basic idea creates a very useful framework within which future
research an development can be evaluated conveniently and objectively
(by measuring the accuracy of routing estimates - optimality would be
where all routing estimates were completely accurate).  

Another way to look at it is that it allows us to apply a reductionist
philosophy to P2P network design, as we have now broken down the goal 
into something an individual node must aspire to - whereas previously 
the goal was network-wide.

> Some questions:
>  * What does "must be efficiently serializable" mean?

Basically I mean that the data structure that represents the "running
average" should be compact as it will need to get passed around the
network (realistically it should be limited to a few hundred bytes at
most).

>  * How are queries routed *alternatively* in order to attempt to
>  bypass a Data Not Found ("DNF") event?  (I'm especially interested in
>  this question, as I haven't figured out how I want to do it in Mnet
>  yet. ;-))

Well, I am not exactly sure what you mean by this - but perhaps your 
question will be answered by the fact that we plan to introduce some 
random salt in the decision as to where a request should be routed, and 
also the fact that the more DNFs a node returns - the worse future 
estimates for its routing time will be.

>  * When do Freenet nodes establish links to newly discovered peers
>  (discovered through DataSource information)?  Are Freenet nodes
>  expected to use NGrouting measurements to choose when to link to new
>  peers?

That is a good question - the NGRouting proposal as it stands now 
inherits Freenet's current approach for new peer discovery - in that new 
peers are acquired from the DataSource fields of DataReply messages, 
although we plan to, along with the address of the peer, also include 
any statistical information about that peer that has been collected by 
an upstream node.  It is very likely that we might want a more 
"NGrouting-ish" approach to new-peer discovery in the future.

> It seems like there are two components of the algorithm which are
> separable from one another, both in Mnet and in Freenet.  One is how
> to collapse the various kinds of measurements of performance --
> latency, throughput, hit rate (== 1-DNF rate), connection-failure-rate
> -- into a single scalar.

Yes, we do this by having quite a well refined definition of what we are
estimating, namely "The time required between routing to this node and
getting the data, assuming the data is of average size".  This means,
for example, that the cost of a Data Not Found message, assuming that
the data *is* in Freenet, is the time required for this node to fail,
plus the time required for another node to fetch the data.  This 
requires a few simplifying assumptions, but all have a rational basis.

> The other is how to combine multiple such
> scalars collected through history into a single scalar prediction of
> which of your peers will likely perform best on the query that you are
> about to send.

Indeed, we basically use standardized running average algorithms for
this (along with our RoutingTimeEstimator algorithm which is a form of
running average).  It is an open question as to how forgetful these
running averages should be (ie. how sensitive to more recent data versus
older data), but it is my suspicion that overall performance won't be
that significantly effected by the exact degree of forgetfulness
provided it is reasonably sane.

> As to the first -- summarizing measurements into a single scalar --
> Both Mnet and Freenet do this, although it isn't the only option --
> one could have a more complicated system that keeps these measurements
> as separate scalars. 

Having separate scalars really just delays a decision that has to be
made at some point - since at the end of the day one needs to do a
binary comparison between sets of scalars for a given node to determine
which is better - and doing that requires combining the scalars somehow
- even if it is implicit.  Far better to have a rational basis for how
those scalars are combined than plucking something out of the air.

> Now the way these two values are combined is the observation that the
> inverse of latency is rate.  That is, if X is the average number of
> seconds you wait to get a block of data, then 1/X is the average
> number of blocks of data you get per second.  If Y is the average hit
> rate -- the number of requests which yielded the correct data divided
> by the total number of requests -- then (1/X) * Y is the average rate
> of blocks you received per request you sent.  (It was my wife, Amber
> Wilcox-O'Hearn, who pointed this out to me.)
> 
> So Mnet combines all kinds of failure (stored in hit rate) and
> performance (stored in average turnaround time) into a single scalar:
> (1/latency) * hitrate.

Very interesting indeed, and quite a different way to look at it.  If 
you were to describe what this number is an estimate of - how would you 
characterize it?  Something like "the average amount of data retrieved 
assuming continuous requests for data"?  I realize that this isn't it - 
but something similar?

> One potential problem with the way it does that is that it rests on
> the assumption that true DNFs -- queries for data that is absent from
> Freenet -- is evenly spread throughout the keyspace.  I can imagine
> cases where that assumption becomes untrue, even for extended periods
> of time or for significant fractions of the keyspace.

This is interesting, I would assume that since keys are chosen for data
pseudo-randomly, that - on average - there would be a random (and
therefore pretty even) distribution of "true" DNFs across the Freenet
keyspace.  Are you saying that a random distribution would not be an 
even distribution?  I am not sure I understand.
 
> Obviously this can work only because the job of partitioning the key
> space is separated from the which-peer-to-route-to question in Mnet. 
> (That job will be performed, in Mnet v0.7, by a separate mechanism as
> part of the DHT-like routing.)

Yes, comparisons of Freenet, particularly post-NGrouting, with DHT
routing algorithms are *very* interesting to me - since they represent
different approaches, which rely on a similar vague principal (get
closer to the data at each hop), but where Freenet's approach is more
"organic" or "heuristic", whereas DHTs are quite rigorous.  This gives
DHTs the benefit of being able to make performance guarantees, which are
essential in many situations, but my suspicion is that a more organic
approach might be better-suited to tolerate the unreliability and
unpredictability of the Internet (particularly when relying on peers
that can vanish at any moment).

> The two-dimensional approach Freenet uses is interesting, and as far
> as I can tell it is a reasonable way to combine the keyspace-related
> performance measurements that Freenet wants.

Bear in mind that we also plan to test Machine Learning algorithms 
here too.

> I'm eager to find out how it works in practice. 

Join the club ;-)

> Unfortunately I don't
> think it will be easy to make any solid observations of how it works,
> especially the sophisticated and complex bits like the influence of
> the minimal-DNF subtraction and the two-dimensional
> history/combination technique.  It will require some really good
> simulations or really good self-monitoring mechanism to provide a
> clear enough picture for us to judge the impact of those details, in
> my opinion.

Well, I am not so sure - we can observe the accuracy of the estimates
produced by this NGrouting implementation for a single node, and based
on that accuracy we should be able to make inferences about the
network's performance as a whole.

> This isn't necessarily bad, but it was a little surprising to me when
> I realized it, and it is a way in which NGrouting is not scale-free. 
> That is: suppose there is a small Freenet where one node receives
> requests which are evenly distributed across 0.4 of the keyspace. 
> Suppose there is a large Freenet where one node receives requests
> which are evenly distributed across 0.004 of the keyspace.  The former
> will move 5 of the points after a few dozen new samples have come in. 
> The latter will move only 2.

I am not sure why this means that the RoutingTimeEstimator algorithm 
implementation described in the paper wouldn't be "scale-free", can you 
elaborate?

In the case where a node received most of its requests within 0.004 of
its keyspace - then isn't it desirable that most of the points be within
that segment to more accurately represent the variations in request
times within that segment of keyspace for which the rest of the network 
obviously considers this node to be an expert?

Perhaps we are working to different definitions of "scale-free".

Ian.

-- 
Ian Clarke                                                  [EMAIL PROTECTED]
Coordinator, The Freenet Project              http://freenetproject.org/
Weblog                               http://slashdot.org/~sanity/journal

pgp00000.pgp
Description: PGP signature

Re: [freenet-dev] (no subject)

Reply via email to