Re: Dijjer and Freenet (RE: [p2p-hackers] clustering)

Oskar Sandberg Sun, 12 Mar 2006 08:19:45 -0800

Serguei Osokine wrote:

        Ian, Oskar, that was wonderful. I rarely see anything that has
that much aesthetic value, but the analyitical proof that the graph
rewiring caused by the random search actions actually makes it asmall-world graph with log^2(N) path length, was certainly worth theone-year wait since last April, which is when Ian has preannouncedthis paper.

It should be noted that the analytic proof in fact does not establishedthis. I did come up with the proof that appears in the thesis lastApril, and I have since (with little success) attempted to generalize itso it holds for the rewiring algorithm. The problem is that the proof,like the Kleinberg proofs, assumes that edges are chosen independentlyat each point, but in fact using our algorithm they will depend on oneanother (in a good way, but that is hard to prove).

In fact, I haven't even managed to show that the class of distributionsfor which I have a bound - those which are solutions to the systemimplied by equations (2.3) and (2.4) in the text - is nonempty. Untilthis is done the theorem as it stands is interesting, but from a strictmathematical viewpoint quite meaningless. (Which is why I have held offon publication).

<>

1. What is the difference between Freenet[1] and Sandberg/Dijjer[2]
   models that makes Dijjer easier to analyze? This is mentioned a few
   times in [2] and in other Freenet- and Dijjer- related places, but
   I did not notice the explicit explanation of the differences
   anywhere. Was it the directed links in the graph in [2]? Something

else? I honestly tried to figure this out by myself, but couldn't;sorry... Maybe if I'd spend more time looking at both [1] and [2],

   I'd figure this out, but since you guys are already here, I thought

that maybe you could answer that?

What makes Freenet (by which I mean the old Freenet routing, not the onethey are using now) much more difficult to analyze is that the datarather than the nodes which have addresses. Data bounces around thenetwork dynamically, and so the markov chain implied becomes a lot morecomplicated. For one thing, Freenet suffers from load balancing issues,while the point to point routing cannot have such issues.

2. Whatever this difference is, how relevant do you think are the
   Sandberg/Dijjer results[2] for the original Freenet paper[1]?

   For example:

        a. section 5.2 of [1] says about Fig. 3: "We can see that the
           pathlength scales approximately logarithmically, with a change
           of slope near 40,000 nodes."

           Given what you guys know now, do you think that Fig. 3 shows
           log^2(N) instead, or Freenet simply has different scaling
           properties due to its different algorithms?

Paths in Freenet are affected by other things than just point to pointrouting. Caching for instance will play a big role. I also don'tremember exactly how we were scaling the routing table size in thosesimulations (the log^2 bound is for a constant routing table size).

I'm guessing that a lot of what was seen in those simulation wereartifacts of other factors.

        b. Fig. 5 shows the link number distribution that awfully
           resembles the power-law one, whereas Oskar mentioned in
another mail that Kleinberg is constant out-degree andPoisson in-degree. Is my assumption that Sandberg/Dijjermodel is supposed to mimic the Kleinberg distibution
           incorrect and in fact both Sandberg/Dijjer and Freenet
           are power-law? Or is just one of them power law due to
           the different algorithms used? Or maybe Fig. 5 in [1] is
actually showing a part of the Poisson distribution thatcan be easily mistaken for the power law?

The model presented in the paper has fixed out-degree and (roughly)poisson in-degree like Kleinberg's. Freenet has a fixed out-degree butpossibly a power-law in-degree, which is the cause of the aforementionedload balancing issues (I don't have any good evidence for this, just thefeeling that their is some sort of preferential attachment going on).

        c. Section 5.4 of [1] says: "In a small-world network, the
           majority of nodes have only relatively few, local, connections

to other nodes, while a small number of nodes have large,wide-ranging sets of connections. Small-world networks permit

           efficient short paths between arbitrary points because of the
           shortcuts provided by the well-connected nodes..."

           This does not sound like anything I've been able to see in the
           Oskar's thesis[2] - it has left me with a distinct impression
           that the connectivity in the small world (at least in the one
           analyzed there) was provided by the exactly right (Kleinberg's)
           proportion of long-distance links, which Sandberg/Dijjer

rewiring model was designed to achieve, and not by the meansof any special nodes with "wide-ranging sets of connections".

           Is it because Freenet and Dijjer used different algorithms, or
           the original assumption about the reasons for the Freenet

connectivity was wrong?

The original paper is based on less sophisticated small-world modelsthan the later stuff. I think most of the small-world references inthere are hand-waving really.

If you place an arbitrary point somewhere where you say that connectionsshorter than that are local and others are long distance, thenKleinberg's model (and ours) will reflect the somewhat fuzzy statementyou quote above.

3. Why would one want to use the system with O(log^2(N)) path length
   when pretty much every DHT gives a path lenght of O(log(N))? At
   today's P2P nets scale (millions of nodes) the difference between
   log and log^2 is non-trivial. I understand that creating the small-
   world network by a simple and natural evolutionary process is very

elegant, simple, and attractive, but is this elegance worth theincreased path length? Why not simply use a DHT?

In real life, the order is much less interesting then the actual routelength. One shouldn't stair blindly at the order when the constantscould be dominating in real world situations. Also, the log^2 is withconstant routing table size, if you scale up the routing table with thesize of the network, you can roughly divide the order by the routingtable size (most log n DHTs scale the routing table with log n as well).

4. Speaking of DHTs: Oskar recently said something curious here thatI did not understand: "Actually, while the frequency of Chord links

   falls exponentially with the "level" (not sure what the Chord term
   is) the length of such links increases exponentially as well, so in
   fact the frequency of links with certain lengths do fall
   harmonically. One could see Chord as some sort of "mean field"
   version of the same dynamics as Kleinberg's model."

   What exactly does this mean? DHTs (including Chord) have log(N)
   diameter, and Kleinberg has log^2(N), right? This looks fairly
   different to me, and I totally missed the meaning of the "mean
   field", and why Chord and Kleinberg would have the same dynamics;
   I did not even understand what is this dynamics that you're talking
   about. Oskar, could you please elaborate on this?

Again, the routing tables grow in Chord. If each node had one "chord"link (every log n th node having a chord of the same length), then Chordwould be roughly log^2 as well.

<>

6. Both the simulation in Freenet paper[1] and Sandberg/Dijjer paper[2]
   use the random requests to rewire the networks. In reality, there
   will be all kinds of hotspots and uneven distributions of requests
   due to the different popularity of content. Do you guys think it
   would affect the results of analysis performed in [2], and if yes,
   how? For example, if one particular file is extremely popular, one

could imagine that its storage node[s] will have many more linksthan the average, and the search for the other (rarely requested)

   data items might become ineffective, because these path lengths will

become very high.

I would imagine that the large number of files stored per node, and therandom distribution implied by using a secure hash would help even outdistribution. I'm not sure what an uneven distribution does under ouralgorithm, but intuitively it should handle it just fine (it would meanmore links to the destination of popular queries, but that is a goodidea since such links are popular...)

   Do you think it is a valid concern? Are there any results that would
   show that this is not an issue?

Yes, and no (besides standard concentration results showing that itshould even out if there are sufficiently many documents per node).

7. While Sandberg [2] assumes that the requests are passing betweenthe random points, both Freenet and Dijjer seem to allow the length

   of the path to be shortened when the requested content is already
   cached on the intermediate node. If this is really so, what is the
   effect that will have on Oskar's analysis results? Fewer links will
   be rewired, so this should have some effect on the graph dynamics,
   right? Any idea what it would be?

This is part of the answer to your first question about why thesesituations are harder to analyze.

8. The answer to '7' above might also depend on the cache size andreplacement policies on the nodes, and on the content popularity
   distribution (which I already mentioned in '6' above). All these
   factors are hard to predict for the real system in advance, and I'm
   wondering what is the feeling that you guys have about the overall
   stability of the rewiring algorithm?


No idea.

   I mean, the original Kleinberg d(x, y)^(-1) approach is extremely
   sensitive to the value of this "-1". Any other value, and you either
   do not have enough long distance links, or you do not have enough
   short-range links once you arrive into the general vicinity of your
   destination. In any case, your path length becomes polynomial instead
   of polylogarithmic once you have the smallest deviation from this
   "-1" number.

It isn't actually that sensitive. It is sensitive asymptotically, butfor any given network size varying the exponent a little will makerather little difference. In fact, one can calculate numerically (I knowof no proof) that there are better exponents than -1 for any given n(though the best exponent approaches 1 as the network grows towardsinfinity).

   And this makes me a bit uneasy; I do understand that the catastrophic
   failure is unlikely, and the detrimental effect of all these changes
   (if any) is likely to be tolerable, but still... Look at it this
way: Chapter 1 of Oskar's thesis is dedicated to proving that"...family given by (1.1) allows for polylogarithmic routing at,and only at, one value of the [alpha]...", then Chapter 2 proves
   that this is exactly what happens with random-request rewiring, and
   then Dijjer's practical implementation and deployment arrive and
   throw these random requests out of the window with their caching,
   different content popularity, and whatnot. So given the strict
requirements that were proven to be vital in Chapter 1, I cannothelp wondering: is Dijjer still polylogarithmic, or not?


It is certainly not proved.

   I'd feel much better about all this if there would be some results
   showing that the system is stable in the presence of such changes
   (in a sense that reasonably small algorithm input changes should not
   cause any catastrophically large changes in the algorithm results).
   Do you have any results that would point in that direction? Would
you say that the data displayed on Figures 2.4-2.6 already showsthat your results are not exactly Kleinberg, but this does notcause any catastrophic consequences, or you also have some other
   sources of optimism?

I would feel better about it too if more were proven. I am optimisticthat things actually do work, but I am not very optimistic about howmuch I will ever be able to show (and it isn't just me, though a bettermathematician may of course have gotten further, I have put theseproblems in front of a many good researchers who seem to agree they aredifficult problems).

9. And finally, if I'm not mistaken, Dijjer description seems to imply
   that the links are rewired after every search, whereas Oskar suggests
   that this should be happen only with proability p, and in section 2.3
   says: "...there are simple heuristic arguments for why p should

reasonably be on the order of one over the expected length of thegreedy walks". Any particular reason for that Dijjer behaviour, or

   I've missed something and Dijjer does, in fact, rewire with this
   probability p? (Would be interesting to hear these simple heuristic
   arguments too, but that's another story...)

It won't matter so much if the routing table is large (remember themathematical model has one long link) though I would recommend thatDijjer updated less often.


// oskar
_______________________________________________
p2p-hackers mailing list
p2p-hackers@zgp.org
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences

Re: Dijjer and Freenet (RE: [p2p-hackers] clustering)

Reply via email to