[freenet-devl] Redundant SplitFile Insertion / Retrieval for fproxy.

thelema Mon, 5 Nov 2001 19:31:26 -0600

On Tue, 06 Nov 2001, toad wrote:

> On Tue, Nov 06, 2001 at 12:01:43AM +0000, toad wrote:
> > On Mon, Nov 05, 2001 at 07:01:36PM -0500, Scott Young wrote:
> > > On Monday 05 November 2001 05:28 pm, you wrote:
> > > > On Mon, 05 Nov 2001, Scott Young wrote:
> > > > I think this idea is just wack.  It's just not going to happen.
> > > >
> > > > Thelema
> > > 
> > > What's wack about it?  I came up with a solution to the partial block 
> > > problem


sorry, I was just responding in proper freenet-devl fashion.  It's the
way we respond to any new idea.  I still think it's a bad idea, so I'll
go into more details.
> > > that this has, and otherwise it uses splitfiles in a very effecient way.  
> > > What if someone inserts a 600 MB ISO of Linux Mandrake 8.0, then wants to 
> > > insert version 8.1?  If 80% of the segments of the first ISO match up to 
> > > the 
Mandrake ISOs are a bad example because there isn't anywhere near this
much overlap, but I'll assume that there's some sort of huge file
out there which has a lot of overlap with some other file, and talk
about this hypothetical thing for the rest of this response.  (the more
I think about it, the less likely I think that having two files that
large and with such huge overlap is.)

> > > second one, then a node could store BOTH of the 600 MB ISOs with less 
> > > than 
> > > 700MB of storage (not including FEC, which even at 50% would bring it 
> > > down to 
> > > a total of 1.3 GB instead of 1.8 GB).  If the 8.0 ISO was well-founded in 
> > > freenet, then the 8.1 ISO would use this, and requests for it would also 
> > > improve the overlapping blocks on the 8.0 ISO.  If this is implemented in 
> > > a 
This is already what happens when you insert CHKs; when you insert two
chunks of data that are the same, you get a CHK collision, and you can
request that part of either file with the same key.

> > > manner that allows use of a list of splitfiles, then a program in an ISO 
> > > of 
> > > Debian could use the same blocks blocks for the program as an ISO of 
> > > Mandrake 
> > > and the same blocks as the program itself listed seperately.  The 
> > > overlapping 
> > > blocks could be obtained, if unavaliable, by using the redundant blocks 
> > > of 
> > > any of the sharing files to reconstruct those files and then caluclate 
> > > the 
> > > blocks; their redundency may be shared.  The biggest problem is in 
> > > calling 
Sharing redundancy would happen exactly when contant would be shared.
There's no problems with this other than if you're going to try to force
CHK collisions by finding substring matches in the two files with huge
overlap, most FEC mechanisms require all the blocks to be the same size,
so you're either wasting space (in the check blocks at least), or you're
going to get a different set of check blocks for the two sets of data
blocks (because order is important)

> > > the checksum at every offset.  This could be alleviated by including 
> > > rolling 
> > > checksums, such as the one used in rsync.  The rolling checksum could 
> > > also be 
> > > used to quickly validate the contets of the downloaded file so the client 
> > > can 
> > > tell if it needs to download some redundant blocks.

you don't really need rolling checksums; you're going to have to have
access to both files locally in order to do this.  The checksum that you
need to match isn't a rolling checksum but the SHA-1 hask used for CHKs.

> > > 
> > > I got this idea when I was looking at the rsync algorithm and wondering 
> > > how 
> > > that type of functionality could benefit Freenet.  The rsync algorithm 
> > > itself 
> > > would only allow someone to, for example, synchronize their Mandrake 8.0 
> > > image with the newer 8.1 image and only have to transfer a little more 
> > > than 
> > > the changed blocks.  This functionality could also be used in freenet 
> > > (although not as effeciently because typical rsync block sizes range from 
> > > 300 
> > > bytes to 8kb and freenet now ranges from 256kb to 1mb).  Actually, if the 
> > > ideas above are implemented and the user has previously downloaded the 
> > > 8.0 
> > > image, the overlapping blocks would already have been downloaded.  If it 
> > > was 
> > > downloaded off the web, then it could still do it in the classic 
> > > rsync-like 
> > > way (download the checksums, see what's the same, and download the 
> > > differences).

I agree rsync is an awesome program.  I don't see any use for the
concepts behind it in a freenet context, as rsync is for when you don't
have both copies of the file, and there's a rsync process running on
each copy of the file.  In freenet, you can't have a rsync process
running on the copy that's in freenet, so you have to have it locally,
and then you can use more traditional techniques to find similarities.
> > > 
> > > Also, next time try and point out why you think ideas won't work.  It 
> > > makes 
> > > your argument stronger and allows the ideas to improve (especially if 
> > > they 
> > > had some significant bad points in the first place).
> > Because you'd have to compute the checksum of every possible range. Sure,
> > rolling checksums help greatly, but you still end up with O(n^2) execution.
Rolling checksums aren't useful because you need to match CHKs, and
there's no way we're dumbing down CHKs to the level needed to do a
rolling checksum.  The only way to take advantage of existing data in
freenet would be to calculate the CHK (decently expensive) of every
section of the file that's the target size long (offset 0, offset 1,
offset 2, etc.)  This is reasonably non-feasable for 600MB files.

> Also, if these are ISO images, you could insert the splitfile with a 2 kb
> granularity? Of course, that means 600*500 = 300,000 files :) More realistic
> would be to insert the individual files as CHKs... since files will normally
> be packed on a CD-ROM, you can use a custom insert client that inserts like
> this:
> CHK 1 per-CD header info, directory, etc
>     2 FILE1.DEB
>     3 short buffer to get to the next file
>     4 FILE2.DEB
>     5 another short buffer
>     6 FILE3.DEB
> 
300,000 files is no good.  even with redundancy, freenet would take
forever to download all those.  The average request (excluding xfer
time) takes 10 seconds to complete.  Even if you saturated all 50
connections your node could handle, it'd still take 60000 seconds to
download all 300,000 pieces = about 16 hours.

> Etc. The individual DEB packages would match up with anywhere else they have
> been inserted (for example as part of apt-get-over-freenet), giving better
> efficiency, and would also match up with later/earlier/other distributions'
> images. And it would all work with current clients (which allow variable part
> sizes), with non-redundant splitting. The buffers would all be less than 2k, 
> so
> unlikely to go missing. One problem remains: if the packages are large, they
> may require redundant splitfile insertion themselves. So, can we have a
> non-redundant splitfile containing segments which are redundant splitfiles? Is
> this a good idea?
I don't agree that the small pieces are unlikely to go missing, but
anyway...
It's perfectly fine to have a non-redundant splitfile containing
segments which are redundant splitfiles.  It'd probably be better to
insert all the .deb files and then have a recipie file that contains all
the "short buffers" and enough other info to piece together the original
ISO given all the .debs.
> > > 
> > > 
> > > -Scott Young
> > > 

Thelema
-- 
E-mail: thelema314 at bigfoot.com        If you love something, set it free.
GPG 1536g/B9C5D1F7 fpr:075A A3F7 F70B 1397 345D  A67E 70AA 820B A806 F95D
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20011105/06fb670f/attachment.pgp>

[freenet-devl] Redundant SplitFile Insertion / Retrieval for fproxy.

Reply via email to