Send Devl mailing list submissions to
        devl at freenetproject.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.uprizer.com/mailman/listinfo/devl
or, via email, send a message with subject or body 'help' to
        devl-request at freenetproject.org

You can reach the person managing the list at
        devl-admin at freenetproject.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Devl digest..."


Today's Topics:

   1. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
   2. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
   3. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
   4. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
   5. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
   6. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
   7. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
   8. Re: Proposal: algorithm for forgetting documents in datastore (Oskar 
Sandberg)
   9. Re: Proposal: algorithm for forgetting documents in datastore (Tavin Cole)
  10. Re: Proposal: algorithm for forgetting documents in datastore (Oskar 
Sandberg)
  11. Re: Proposal: algorithm for forgetting documents in datastore (Oskar 
Sandberg)

--__--__--

Message: 1
Date: Sun, 28 Jan 2001 19:06:28 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 05:23:12PM -0500, Scott G. Miller wrote:
> On Sun, Jan 28, 2001 at 10:24:42PM +0100, Oskar Sandberg wrote:
> > On Sun, Jan 28, 2001 at 12:08:04PM -0500, Scott G. Miller wrote:
> > > On Sun, Jan 28, 2001 at 04:38:49PM +0000, Theodore Hong wrote:
> > < > 
> > > > Limiting the past time interval would require picking an arbitrary 
> > > > cutoff.
> > > > The best way is to apply a decay factor to old requests, something like:
> > > > 
> > > > P = #requests in the past hour
> > > >         + (#requests in the previous 2 hrs)/2
> > > >         + (#requests in the previous 4 hrs)/4
> > > >         + (#requests in the previous 8 hrs)/8
> > > >         + ...
> > > 
> > > Every two hours you halve P.  When a request comes in you add a constant.
> > 
> > That is the same thing, just Theo put it as a formula. I think a sense a
> > cultural conflict here...
> I was suggesting an implementation of his formula.    Not disagreeing in
> any way.

Not quite, your algorithm divides by 8 after 6 hours, not 8.. but anyway..

-- 

// Tavin Cole


--__--__--

Message: 2
Date: Sun, 28 Jan 2001 19:17:58 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 12:05:33PM -0500, Scott G. Miller wrote:
> On Sun, Jan 28, 2001 at 02:28:26AM -0500, Tavin Cole wrote:
> > On Sat, Jan 27, 2001 at 05:14:55PM -0500, Benjamin Coates wrote:
> > > >It seems totally reasonble for a node to reject caching documents greater
> > > >in size than a certain fraction of the total datastore size (1% or less
> > > >I should think).  That prevents your worst-case example, but saves us 
> > > >from
> > > >making relative value judgments using file size as the metric..
> > > 
> > > 1% is pretty tiny, that's 1MB on the default datastore size, and also 
> > > leads to 
> > > the bizarre situation where all the insert nodes pass on the request and 
> > > return success, even though none of them store it.  Maybe nodes should 
> > > return 
> > > failure instead of passing through an insert they can't cache...
> > 
> > OTOH, it's probably not good if a datastore can only hold a handful of keys
> > b/c they are all a considerable fraction of its max size.  1% may not be
> > right, tho.
> 1% is definitely too small.  I don't want any assumptions made unless they
> include datastores that are on average 100-200mb in size (a size I
> consider optimal in a large freenet consisting of 'normal', i.e., not
> dedicated nodes).

hmm.. 2^-5?

> I think its probably reasonable to store 5mb files in a datastore, as
> well.  This all comes down to the file splitting issue, where there are a
> number of questions;
> 
> Max file size

We may not need to explicitly enforce this:  if nodes reject files greater than
storesize*2^-5 or whatever, it effectively implements a max file size.  OTOH
we could make it a matter of policy, which would result in the minimum allowable
datastore size being Max file size * 2^5 (or whatever).

> Minimum file size (under which we pad)

Probably 512 or 1024 bytes..  so that all control documents look alike and are
indistinguishable from small pieces of real content.

> Continuous or discrete file sizes (ie only powers of 2, or any size).

Definitely only powers of 2.  Not only good for defeating traffic analysis,
but it will make it a lot easier to analyze & simulate caching/forgetting
algorithms.

> Number of parts into which a file can be split but still be retrieved

Is there any reason to make this policy?  Certainly a best practice will
evolve from empirical evidence.

> Redundancy? (I'm with Ian, this is not really a good idea in Freenet)

We don't need to worry about this now.. it's probably strictly a client
issue anyway.


*sigh* .. all these arbitrary decisions we can't avoid..


-- 

// Tavin Cole


--__--__--

Message: 3
Date: Sun, 28 Jan 2001 19:33:25 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 04:38:49PM +0000, Theodore Hong wrote:
> Tavin Cole <tavin at mailandnews.com> wrote:
> > It's clear to me now that the only sane mechanism for deciding what files
> > to keep in the document store is one that consistently preserves the
> > files that receive the greatest number of requests.  So let's define P as
> > the measure of a file's request history.  Simplistically,
> >
> >  P = total no. of requests for this file / time since file was stored
> >
> > In practice, though, I think we'd want to limit the past time interval to
> > what is "recent".  Also, the units in the denominator should not be too
> > granular, to keep P from blowing out of proportion for brand new files.
> > Something in the neighborhood of hours or minutes, probably.
> 
> Limiting the past time interval would require picking an arbitrary cutoff.
> The best way is to apply a decay factor to old requests, something like:

Yes, I was going to suggest that :)   But any decay model is just as arbitrary
as an arbitrary cutoff...


> P = #requests in the past hour
>       + (#requests in the previous 2 hrs)/2
>       + (#requests in the previous 4 hrs)/4
>       + (#requests in the previous 8 hrs)/8
>       + ...
> 
> (I'm sure there's a nicer way to express this as a continuous function 
> using an integral, but I can't think of it at the moment.  Anyway, you
> get the idea.)

Sure, although I don't think we should be decaying P until something is
at least a day or two old.


> > When a new file is added to the store, as a result of either a request or
> > an insert, it could be allocated 1 request in the above equation.  When
> > the store is full, you basically want to discard the entries with lowest
> > P.
> >
> > This is trivial if all files are the same size.  The question is, if a
> > file is going to displace >1 smaller files, do we compare its P to the
> > summed P of the smaller files?  Do we compare its P to some function of
> > the P of the smaller files, where f(P1..n) < P1 + .. + Pn ?  Or do we
> > ignore the size difference and just look at the highest P of the smaller
> > files?
> 
> I think the most straightforward way is simply to compare the one file's P
> to the sum of all the P's it would displace.  This maximizes the number of
> successful requests.  For example, one 100k file getting 11 requests/day
> should be ranked higher than ten 10k files getting 1 request/day each.

Perhaps...  but we don't want to make larger files non-viable either.  They
can only be split into so many pieces.

Suppose the choice is between a big file with P=100 and 5 small files with
P=25.  I'd rather keep the big one with P=100.. the other files can find
homes in other caches..  otherwise this super-popular big file wouldn't
ever be cached anywhere, since you could probably always find enough small
files with P=1 to add up to more than 100!

> The difficulty is deciding how to assess P for a new file.  If it's too
> low, new files may never make it in.  If it's too high, we just have the
> current situation (new files always have highest priority).

Well, you either assign the new file one request in the equation for P,
or you assign it some fraction <=1 of the mean P in the document store.
It's the same as choosing where to position it in the stack.

-- 

// Tavin Cole


--__--__--

Message: 4
Date: Sun, 28 Jan 2001 19:34:36 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 01:04:23PM -0800, Mr . Bad wrote:
> Well, it seems that size is -too- much of a factor, and that it's hard
> to get large files (like, MP3-size files) to stay in the system for
> any reasonable length of time.
> 
> Does anyone else have the same experience? Or am I once again smoking
> crack?

And not sharing!

-- 

// Tavin Cole


--__--__--

Message: 5
Date: Sun, 28 Jan 2001 19:39:49 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 04:34:51PM +0000, Theodore Hong wrote:
> > > What units we deal with doesn't really matter since you are only looking
> > > at ratios anyways.
> >
> > No, the units don't matter for computations, but they are important when
> > you try to assess the *meaning* of the math.  For example, I'm pretty
> > skeptical of a quantity measured in byte-seconds, and don't see any a
> > priori reason why it deserves a principle of least action... if you'll
> > excuse the metaphor :)
> 
> What's intrinsically wrong with byte-seconds?  It has quite a clear meaning
> -- it is a volume in spacetime.  Over the lifetime of a node, its datastore
> has a certain total volume, so this is just a packing problem: how do we
> allocate space in this volume so that the total value stored is maximized?

Okay, this makes sense, thanks.  However, it seems very difficult to account
for the importance of popularity when you look at things this way.  Our
phase space should include popularity as well as bytesize and time.
E[Ti]*sizei = k assumes every file is equally popular.

> The fact that the units may look strange is irrelevant -- you might as well
> ask why anyone would care about the quantity kilogram-meters^2/seconds^2?
> Or measure electrical quantities in centimeters?

I'm not afraid of how they look, as long as they have a good rationale.
kg*m^2/s^2 has a good rationale.

-- 

// Tavin Cole


--__--__--

Message: 6
Date: Sun, 28 Jan 2001 19:51:16 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 04:37:04PM +0100, Oskar Sandberg wrote:
> > > In theory the equation is very simple: E[Ti]*sizei = k, where k is a
> > > constant, Ti is the random variable giving for the length of time a
> > > document i stays in the cache, and sizei is the size of document i.
> > > Figuring out the formula for Ti and thus the expected value from an
> > > algorithm (or in this situation, actually the opposite - finding an
> > > algorithm that gives E[Ti] = k / sizei) is of course the difficult part,
> > > but it's neither impossible nor fuzzy in any way.
> > 
> > So, I'm guessing E[] denotes summation and you are meaning to write
> > E[Ti*sizei] = k .. am I right?  Is this summation based on a snapshot
> > of the datastore at a given moment, where the Ti is the time for each
> > document to live from that moment?  Is k a function of the total
> > number of documents (I hope)?
> 
> No, E[X] means the expected value of stochastic (random) variable X. It
> is given by:
> 
> E[X] = Sum(x*P(X=x)) 
> 
> over all possible outcomes x for X, where P(X=x) is the probability that X
> takes that value (for continuous values like ours you integrate instead
> of course).

Ok, got it.  I was overly hung up on summation b/c of something you said about
doing an integral.. but of course you meant integrating over x there.

I still see no a priori basis for asserting E[Ti]*sizei = k.  The only way
it might make sense is if every file were equally popular.  Even then,
10 100-byte files would live 10 times as long as one 1000 byte file..
even if they constituted the 10 split components of that 1000 byte file..

> And as you may have noticed I am not arguing for this model, I have been
> arguing against it...

Thank you for presenting the arguments of the other side, nevertheless.
Of course, you make an excellent advocate of the devil, Oskar.

-- 

// Tavin Cole


--__--__--

Message: 7
Date: Sun, 28 Jan 2001 20:00:52 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 01:27:54AM -0800, Ian Clarke wrote:
> On Sun, Jan 28, 2001 at 03:14:30AM -0500, Tavin Cole wrote:
> > The only legitimate metric we have for the value of any file on Freenet is
> > its request history.  To my understanding, even Ian was not arguing that
> > smaller files are intrinsically more valuable, but rather that we need to
> > be concerned about how the value of a large file compares to the aggregate
> > value of the number of small files it potentially replaces.
> 
> What do you mean *EVEN* Ian? ;-)

You know, good ol' even-keeled Ian..

> > It's clear to me now that the only sane mechanism for deciding what files
> > to keep in the document store is one that consistently preserves the files
> > that receive the greatest number of requests.  So let's define P as the
> > measure of a file's request history.  Simplistically,
> > 
> >   P = total no. of requests for this file / time since file was stored
> > 
> > In practice, though, I think we'd want to limit the past time interval
> > to what is "recent".  Also, the units in the denominator should not
> > be too granular, to keep P from blowing out of proportion for brand
> > new files.  Something in the neighborhood of hours or minutes, probably.
> 
> The problem here, and the issue that lead me to initially choose a Least
> Recently Used (LRU) cache, is what units should be used for the time
> since the file was last accessed/stored/whatever.  Deciding on this
> requires making an arbitrary decision on how quickly requests for a file
> should be "forgotten", and I don't like arbitrary decisions.  Employing
> a LRU cache avoided this issue since the actual times involved are not
> relevant, just the order in which events happened.

Believe me, I sympathize, but eventually you've got to face the music and
make the arbitrary decisions that are unavoidable as a result of choosing
the best algorithm.  And we don't necessarily have to have an algorithm
that explicitly keeps track of P and decays it appropriately;  we just need
an algorithm that produces this effect.

If we agree that the correct way to measure P *is* the way I defined it
above (leaving aside the specific details of decay/cutoff), and we agree
that the best algorithm is the one that maximizes P over the whole store,
then I'm afraid the LRU cache will not do very well.

Fortunately, if we can agree on the first 2 points above, this can be
simulated and proven one way or the other.

-- 

// Tavin Cole


--__--__--

Message: 8
Date: Mon, 29 Jan 2001 02:04:40 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 07:39:49PM -0500, Tavin Cole wrote:
<> 
> Okay, this makes sense, thanks.  However, it seems very difficult to account
> for the importance of popularity when you look at things this way.  Our
> phase space should include popularity as well as bytesize and time.
> E[Ti]*sizei = k assumes every file is equally popular.

Yeah, I meant that all along. Ie, k is not constant but a function of the
number of requests that is shared by all i. Sorry.

> > The fact that the units may look strange is irrelevant -- you might as well
> > ask why anyone would care about the quantity kilogram-meters^2/seconds^2?
> > Or measure electrical quantities in centimeters?
> 
> I'm not afraid of how they look, as long as they have a good rationale.
> kg*m^2/s^2 has a good rationale.

That a nodes capacity in Freenet is measured in the space available * the
period of time it is available seems quite obvious to me...

> 
> -- 
> 
> // Tavin Cole
> 
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://www.uprizer.com/mailman/listinfo/devl

-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se


--__--__--

Message: 9
Date: Sun, 28 Jan 2001 20:04:27 -0500
From: Tavin Cole <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 12:17:11PM -0800, Ian Clarke wrote:
> The choice of units is essentially arbitrary but *will* effect the
> behaviour of the system.  The need to make arbitrary decisions are a
> symptom of a flawed design.  This was one of the advantages of using a
> LRU cache.

Choosing an inferior algorithm to run away from the arbitrary decision
necessitated by a better algorithm is equally flawed.


> Before we start proposing replacements for the current mechanism,
> someone should actually figure out if there is anything wrong with it!
> If it isn't broken, don't fix it.

I am going to run some simulations..  I think there's adequate reason to
be suspicious of the current mechanism :)

-- 

// Tavin Cole


--__--__--

Message: 10
Date: Mon, 29 Jan 2001 02:09:54 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 07:51:16PM -0500, Tavin Cole wrote:
<> 
> I still see no a priori basis for asserting E[Ti]*sizei = k.  The only way
> it might make sense is if every file were equally popular.  Even then,
> 10 100-byte files would live 10 times as long as one 1000 byte file..
> even if they constituted the 10 split components of that 1000 byte file..

Right, that was my argument. E[Ti]*sizei = k(Ri) is based on the
assumption that all files are equally valuable, but that assumption leads
to a lot weird artifacts when you realize that one file could contain
exactly the same thing as all of 100 others...

<>

-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se


--__--__--

Message: 11
Date: Mon, 29 Jan 2001 02:32:11 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 07:17:58PM -0500, Tavin Cole wrote:
> On Sun, Jan 28, 2001 at 12:05:33PM -0500, Scott G. Miller wrote:
<>
> > 1% is definitely too small.  I don't want any assumptions made unless they
> > include datastores that are on average 100-200mb in size (a size I
> > consider optimal in a large freenet consisting of 'normal', i.e., not
> > dedicated nodes).
> 
> hmm.. 2^-5?

That could work, though I don't think this needs to be a mandated value.
It really depends on the size of the store.

> > I think its probably reasonable to store 5mb files in a datastore, as
> > well.  This all comes down to the file splitting issue, where there are a
> > number of questions;
> > 
> > Max file size
> 
> We may not need to explicitly enforce this:  if nodes reject files greater 
> than
> storesize*2^-5 or whatever, it effectively implements a max file size.  OTOH
> we could make it a matter of policy, which would result in the minimum 
> allowable
> datastore size being Max file size * 2^5 (or whatever).

It's not enforceable anyways, nodes can always lie. These rules are for
the benefit of the node operator.

> > Minimum file size (under which we pad)
> 
> Probably 512 or 1024 bytes..  so that all control documents look alike and are
> indistinguishable from small pieces of real content.

1024 bytes at least. 

> > Continuous or discrete file sizes (ie only powers of 2, or any size).
> 
> Definitely only powers of 2.  Not only good for defeating traffic analysis,
> but it will make it a lot easier to analyze & simulate caching/forgetting
> algorithms.

I agree, though for the first reason rather than the second.

> > Number of parts into which a file can be split but still be retrieved
> 
> Is there any reason to make this policy?  Certainly a best practice will
> evolve from empirical evidence.

Making policy on something like this is self destructive. Having the
client rejecting to download more parts against a users wishes is
completely unethical, basically on the same level as our worst enemies
(we could start suing people who make clients that don't do it!)

<>
-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se



--__--__--

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://www.uprizer.com/mailman/listinfo/devl


End of Devl Digest

Reply via email to