Devl digest, Vol 1 #156 - 11 msgs

[email protected] Sun, 28 Jan 2001 09:01:02 -0800 (PST)

Send Devl mailing list submissions to
        devl at freenetproject.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.uprizer.com/mailman/listinfo/devl
or, via email, send a message with subject or body 'help' to
        devl-request at freenetproject.org


You can reach the person managing the list at
        devl-admin at freenetproject.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Devl digest..."


Today's Topics:

   1. Re: Proposal: algorithm for forgetting documents in datastore (Adam 
Langley)
   2. Re: Proposal: algorithm for forgetting documents in datastore (Theodore 
Hong)
   3. Re: Proposal: algorithm for forgetting documents in datastore (Oskar 
Sandberg)
   4. Re: Proposal: algorithm for forgetting documents in datastore (Oskar 
Sandberg)
   5. Re: Proposal: algorithm for forgetting documents in datastore (Oskar 
Sandberg)
   6. Re: Proposal: algorithm for forgetting documents in datastore (Oskar 
Sandberg)
   7. Re: Suggestion for solving the small vs. large file debate (Oskar 
Sandberg)
   8. Re: Proposal: algorithm for forgetting documents in datastore (Theodore 
Hong)
   9. Re: Suggestion for solving the small vs. large file debate (Theodore Hong)
  10. Re: Proposal: algorithm for forgetting documents in datastore (Theodore 
Hong)
  11. Re: Proposal: algorithm for forgetting documents in datastore (Scott G. 
Miller)

--__--__--

Message: 1
Date: Sun, 28 Jan 2001 10:49:43 +0000
From: Adam Langley <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org


--+HP7ph2BbKc20aGI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Jan 27, 2001 at 11:48:11PM +0100, Oskar Sandberg wrote:
> I haven't really been able to make up my mind on this. Should I reply
> with a RequestFailed on Inserts for data larger than whatever the nodes
> limit is?

Doing a Socket->Socket transfer isn't going to break the code any
(I expect) so I'd say pass it on. It helps the nodes to find their
keyspace quickly. Otherwise you're fuzzying (bad word, I hope you
know what I mean) the routing.

AGL

--=20
Whenever anyone says, "theoretically," they really mean, "not really."

--+HP7ph2BbKc20aGI
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iEYEARECAAYFAjpz+UcACgkQzaVS3yy2PWCmpgCdH+RWSgc5Z7LPeDuDV9uElpEV
o0wAn3lNv4Jyi118/AfECpcTsohcf0q2
=sGmZ
-----END PGP SIGNATURE-----

--+HP7ph2BbKc20aGI--


--__--__--

Message: 2
From: Theodore Hong <[email protected]>
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
To: devl at freenetproject.org
Date: Sun, 28 Jan 2001 14:16:21 +0000 (GMT)
Reply-To: devl at freenetproject.org

"Mark J. Roberts" <mjr at statesmean.com> wrote:
> On Sat, 27 Jan 2001, Oskar Sandberg wrote:
> > Either way is just as screwed up, and at least the former is the natural
> > default and doesn't allow strategic techniques for making files live
> > longer.
> <>
> > Perhaps a compromise that sets VAD = 1/f(size) where f is some less than
> > linear function is a possible compromise.
> 
> I tend to favor Ian's model because I'm a proponent of segmentation (even
> slightly redundant segmentation), and Ian's model encourages such
> segmentation. However, I certainly wouldn't want to see it pushed to
> extremes, where data must be inserted as tiny segments to survive. But I
> think we should discourage large (>1M) files. Looks like a compromise is
> necessary.

Historically, this tension between large and small resulted in the Great
Compromise of 1787 in the United States.  The rights of small states were
preserved by equal representation by state in the Senate, while those of
large states were preserved by proportional representation by size in the
House.  Perhaps we should adopt something similar.

theo



--__--__--

Message: 3
Date: Sun, 28 Jan 2001 16:12:20 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sat, Jan 27, 2001 at 11:08:46PM -0500, Scott G. Miller wrote:
> > > the bizarre situation where all the insert nodes pass on the request and 
> > > return success, even though none of them store it.  Maybe nodes should 
> > > return 
> > > failure instead of passing through an insert they can't cache...
> > 
> > I haven't really been able to make up my mind on this. Should I reply
> > with a RequestFailed on Inserts for data larger than whatever the nodes
> > limit is?
> No, because won't the downstream node try its next choice?  If it does,
> then there's no way the request will follow that same path.

Well, if the routing is that fragile then we are fucked all the same. The
intention would be that the routes through the node in question and the
route not through it would converge...


-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se


--__--__--

Message: 4
Date: Sun, 28 Jan 2001 16:22:53 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 02:31:20AM -0500, Tavin Cole wrote:
> On Sat, Jan 27, 2001 at 11:48:11PM +0100, Oskar Sandberg wrote:
> > On Sat, Jan 27, 2001 at 05:14:55PM -0500, Benjamin Coates wrote:
> > <> 
> > > 1% is pretty tiny, that's 1MB on the default datastore size, and also 
> > > leads to 
> > > the bizarre situation where all the insert nodes pass on the request and 
> > > return success, even though none of them store it.  Maybe nodes should 
> > > return 
> > > failure instead of passing through an insert they can't cache...
> > 
> > I haven't really been able to make up my mind on this. Should I reply
> > with a RequestFailed on Inserts for data larger than whatever the nodes
> > limit is?
> 
> But even if the node can't store the data b/c it is too large, it could
> stream it to an upstream node which can store the data, right?

On a request we have to way to know the size of the data until it is
really too late to reject the request, so for that we would simply have
the node tunnel the data but not cache it. We could do the same for
Inserts, but people are concerned that someone would insert a large
document and not know that all the nodes on the route failed to cache it,
and on Inserts we do have opportunity reject the InsertRequest right away
if the data is to big (we would have to add a InsertSize field to it and
then make sure that is >= the actual DataLength, but that is trivial).

> How would we address this?

Streaming but not caching would be handled by a circular file buffer, ie
we give it a couple of hundred K and then we overwrite the beginning when
we reach the end. This works better on Requests then Inserts, since if the
sending of the data fails on Inserts the node needs to be able to start
over from the beginning with another node, but a node that tries to
restart but has overwritten the beginning of the data could just send
back a RequestFailed, so...

> -- 
> 
> // Tavin Cole
> 
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://www.uprizer.com/mailman/listinfo/devl

-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se


--__--__--

Message: 5
Date: Sun, 28 Jan 2001 16:37:04 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 02:51:41AM -0500, Tavin Cole wrote:
> On Sat, Jan 27, 2001 at 05:23:52PM +0100, Oskar Sandberg wrote:
> > > > Even a model that perfectly matched the size integral over expected time
> > > > would be based on a flawed premise.
> > > 
> > > It's also based on a pseudomath unless someone can formulate this as an
> > > equation.  For example, what are the units of this thing?
> > > megabytes/sec * dt doesn't make a heck of a lot of sense..
> > 
> > I think a lot of probability theorists would be a rather insulted at
> > calling their field pseudo math (I, OTOH...). 
> 
> Hey, that's a compliment.  It's almost math ;^p  --  actually what makes
> you think I was referring to probability theorists anyway?

Because with probability theory you can set up an equation without knowing
the exact values of when data enters and leaves the node.

> > In theory the equation is very simple: E[Ti]*sizei = k, where k is a
> > constant, Ti is the random variable giving for the length of time a
> > document i stays in the cache, and sizei is the size of document i.
> > Figuring out the formula for Ti and thus the expected value from an
> > algorithm (or in this situation, actually the opposite - finding an
> > algorithm that gives E[Ti] = k / sizei) is of course the difficult part,
> > but it's neither impossible nor fuzzy in any way.
> 
> So, I'm guessing E[] denotes summation and you are meaning to write
> E[Ti*sizei] = k .. am I right?  Is this summation based on a snapshot
> of the datastore at a given moment, where the Ti is the time for each
> document to live from that moment?  Is k a function of the total
> number of documents (I hope)?

No, E[X] means the expected value of stochastic (random) variable X. It
is given by:

E[X] = Sum(x*P(X=x)) 

over all possible outcomes x for X, where P(X=x) is the probability that X
takes that value (for continuous values like ours you integrate instead
of course).

> > What units we deal with doesn't really matter since you are only looking
> > at ratios anyways.
> 
> No, the units don't matter for computations, but they are important when
> you try to assess the *meaning* of the math.  For example, I'm pretty
> skeptical of a quantity measured in byte-seconds, and don't see any a
> priori reason why it deserves a principle of least action... if you'll
> excuse the metaphor :)    But I'll table that argument until you can
> set me straight on the specifics of the above equation.

The term "byte-seconds" is of course not important, it's the fact that we
have space-time which is important. "k" in the above equation is measured
in spacetime since E[Ti] is a unit of time and spacei is a measure of
time.

And as you may have noticed I am not arguing for this model, I have been
arguing against it...

-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se


--__--__--

Message: 6
Date: Sun, 28 Jan 2001 16:40:29 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 02:16:21PM +0000, Theodore Hong wrote:
> "Mark J. Roberts" <mjr at statesmean.com> wrote:
< > 
> > I tend to favor Ian's model because I'm a proponent of segmentation (even
> > slightly redundant segmentation), and Ian's model encourages such
> > segmentation. However, I certainly wouldn't want to see it pushed to
> > extremes, where data must be inserted as tiny segments to survive. But I
> > think we should discourage large (>1M) files. Looks like a compromise is
> > necessary.
> 
> Historically, this tension between large and small resulted in the Great
> Compromise of 1787 in the United States.  The rights of small states were
> preserved by equal representation by state in the Senate, while those of
> large states were preserved by proportional representation by size in the
> House.  Perhaps we should adopt something similar.

Because the whole American democracy thing worked out sooo well....

-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se


--__--__--

Message: 7
Date: Sun, 28 Jan 2001 16:44:16 +0100
From: Oskar Sandberg <[email protected]>
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Suggestion for solving the small vs. large file 
debate
Reply-To: devl at freenetproject.org

On Sun, Jan 28, 2001 at 01:29:41AM -0800, Ian Clarke wrote:
> On Sun, Jan 28, 2001 at 01:11:00AM -0800, Dev Random wrote:
> > Yeah, I have proposed this awhile ago with each bucket storing files
> > twice as large as the previous bucket.  The algorithm for retiring files
> > was to retire based on age in seconds multiplied by the size of the file.
> > The file with the largest such "weight" would be retired if the datastore
> > was out of space.
> 
> But why seconds, and what units would you choose for the size of the
> file?  This requires making an arbitrary decision, and arbitrary
> decisions are bad ("hey, lets have a memory limit of 640k, nobody will
> ever need more than that!").

Yeah, because the oldest file in seconds will so not be the oldest file in
minutes, and a file twice as large in bytes will so not be twice as
large in kB...

I don't think that Dev's idea is a good one at all, but while your
argument against abitrary decisions made a lot of sense against what Tavin
proposed, it makes no sense here.

-- 
'DeCSS would be fine. Where is it?'
'Here,' Montag touched his head.
'Ah,' Granger smiled and nodded.

Oskar Sandberg
md98-osa at nada.kth.se


--__--__--

Message: 8
From: Theodore Hong <[email protected]>
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
To: devl at freenetproject.org
Date: Sun, 28 Jan 2001 16:34:51 +0000 (GMT)
Reply-To: devl at freenetproject.org

Tavin Cole <tavin at mailandnews.com> wrote:
> On Sat, Jan 27, 2001 at 05:23:52PM +0100, Oskar Sandberg wrote:
> > > > Even a model that perfectly matched the size integral over expected
> > > > time would be based on a flawed premise.
> 
> > In theory the equation is very simple: E[Ti]*sizei = k, where k is a
> > constant, Ti is the random variable giving for the length of time a
> > document i stays in the cache, and sizei is the size of document i.
> > Figuring out the formula for Ti and thus the expected value from an
> > algorithm (or in this situation, actually the opposite - finding an
> > algorithm that gives E[Ti] = k / sizei) is of course the difficult part,
> > but it's neither impossible nor fuzzy in any way.
>
> So, I'm guessing E[] denotes summation and you are meaning to write
> E[Ti*sizei] = k .. am I right?  Is this summation based on a snapshot of
> the datastore at a given moment, where the Ti is the time for each
> document to live from that moment?  Is k a function of the total number
> of documents (I hope)?

I think Oskar means expectation value.  So under Ian's model, the product
of expected-time-in-store * size is the same (k) for all files.

> > What units we deal with doesn't really matter since you are only looking
> > at ratios anyways.
>
> No, the units don't matter for computations, but they are important when
> you try to assess the *meaning* of the math.  For example, I'm pretty
> skeptical of a quantity measured in byte-seconds, and don't see any a
> priori reason why it deserves a principle of least action... if you'll
> excuse the metaphor :)

What's intrinsically wrong with byte-seconds?  It has quite a clear meaning
-- it is a volume in spacetime.  Over the lifetime of a node, its datastore
has a certain total volume, so this is just a packing problem: how do we
allocate space in this volume so that the total value stored is maximized?

The fact that the units may look strange is irrelevant -- you might as well
ask why anyone would care about the quantity kilogram-meters^2/seconds^2?
Or measure electrical quantities in centimeters?

theo



--__--__--

Message: 9
From: Theodore Hong <[email protected]>
Subject: Re: [freenet-devl] Suggestion for solving the small vs. large file 
debate
To: devl at freenetproject.org
Date: Sun, 28 Jan 2001 16:37:03 +0000 (GMT)
Reply-To: devl at freenetproject.org

Ian Clarke <lists at octayne.com> wrote:
> On Sun, Jan 28, 2001 at 01:11:00AM -0800, Dev Random wrote:
> > Yeah, I have proposed this awhile ago with each bucket storing files
> > twice as large as the previous bucket.  The algorithm for retiring
> > files was to retire based on age in seconds multiplied by the size of
> > the file.  The file with the largest such "weight" would be retired if
> > the datastore was out of space.
>
> But why seconds, and what units would you choose for the size of the
> file?  This requires making an arbitrary decision, and arbitrary
> decisions are bad ("hey, lets have a memory limit of 640k, nobody will
> ever need more than that!").

Not at all.  As Oskar said, the units don't matter, since you are only
interested in relative weights.  

theo



--__--__--

Message: 10
From: Theodore Hong <[email protected]>
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
To: devl at freenetproject.org
Date: Sun, 28 Jan 2001 16:38:49 +0000 (GMT)
Reply-To: devl at freenetproject.org

Tavin Cole <tavin at mailandnews.com> wrote:
> It's clear to me now that the only sane mechanism for deciding what files
> to keep in the document store is one that consistently preserves the
> files that receive the greatest number of requests.  So let's define P as
> the measure of a file's request history.  Simplistically,
>
>  P = total no. of requests for this file / time since file was stored
>
> In practice, though, I think we'd want to limit the past time interval to
> what is "recent".  Also, the units in the denominator should not be too
> granular, to keep P from blowing out of proportion for brand new files.
> Something in the neighborhood of hours or minutes, probably.

Limiting the past time interval would require picking an arbitrary cutoff.
The best way is to apply a decay factor to old requests, something like:

P = #requests in the past hour
        + (#requests in the previous 2 hrs)/2
        + (#requests in the previous 4 hrs)/4
        + (#requests in the previous 8 hrs)/8
        + ...

(I'm sure there's a nicer way to express this as a continuous function 
using an integral, but I can't think of it at the moment.  Anyway, you
get the idea.)

> When a new file is added to the store, as a result of either a request or
> an insert, it could be allocated 1 request in the above equation.  When
> the store is full, you basically want to discard the entries with lowest
> P.
>
> This is trivial if all files are the same size.  The question is, if a
> file is going to displace >1 smaller files, do we compare its P to the
> summed P of the smaller files?  Do we compare its P to some function of
> the P of the smaller files, where f(P1..n) < P1 + .. + Pn ?  Or do we
> ignore the size difference and just look at the highest P of the smaller
> files?

I think the most straightforward way is simply to compare the one file's P
to the sum of all the P's it would displace.  This maximizes the number of
successful requests.  For example, one 100k file getting 11 requests/day
should be ranked higher than ten 10k files getting 1 request/day each.

The difficulty is deciding how to assess P for a new file.  If it's too
low, new files may never make it in.  If it's too high, we just have the
current situation (new files always have highest priority).

theo



--__--__--

Message: 11
Date: Sun, 28 Jan 2001 11:59:47 -0500
To: devl at freenetproject.org
Subject: Re: [freenet-devl] Proposal: algorithm for forgetting documents in 
datastore
From: "Scott G. Miller" <[email protected]>
Reply-To: devl at freenetproject.org


--JYK4vJDZwFMowpUq
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Jan 28, 2001 at 04:12:20PM +0100, Oskar Sandberg wrote:
> On Sat, Jan 27, 2001 at 11:08:46PM -0500, Scott G. Miller wrote:
> > > > the bizarre situation where all the insert nodes pass on the reques=
t and=20
> > > > return success, even though none of them store it.  Maybe nodes sho=
uld return=20
> > > > failure instead of passing through an insert they can't cache...
> > >=20
> > > I haven't really been able to make up my mind on this. Should I reply
> > > with a RequestFailed on Inserts for data larger than whatever the nod=
es
> > > limit is?
> > No, because won't the downstream node try its next choice?  If it does,
> > then there's no way the request will follow that same path.
>=20
> Well, if the routing is that fragile then we are fucked all the same. The
> intention would be that the routes through the node in question and the
> route not through it would converge...
Well yeah, thats of course what we'd hope.  But it means that large files
are likely to be less reliably found.


--JYK4vJDZwFMowpUq
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE6dFACr9IW4v3mHtQRAuYXAJ9sSCp3t29xRDBmt1HTS3Y6tRxaDwCeN0w6
jymOAs96QksEN1fUX0AOHgE=
=tX3v
-----END PGP SIGNATURE-----

--JYK4vJDZwFMowpUq--



--__--__--

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://www.uprizer.com/mailman/listinfo/devl


End of Devl Digest

Devl digest, Vol 1 #156 - 11 msgs

Reply via email to