[freenet-devl] Redundant SplitFile Insertion / Retrieval for fproxy.

toad Tue, 6 Nov 2001 01:46:02 +0000

On Mon, Nov 05, 2001 at 07:31:26PM -0600, thelema wrote:
> On Tue, 06 Nov 2001, toad wrote:
> 
> > On Tue, Nov 06, 2001 at 12:01:43AM +0000, toad wrote:
> > > On Mon, Nov 05, 2001 at 07:01:36PM -0500, Scott Young wrote:
> > > > On Monday 05 November 2001 05:28 pm, you wrote:
> > > > > On Mon, 05 Nov 2001, Scott Young wrote:
> > > > > I think this idea is just wack.  It's just not going to happen.
> > > > >
> > > > > Thelema
> > > > 
> > > > What's wack about it?  I came up with a solution to the partial block 
> > > > problem 
> 
> sorry, I was just responding in proper freenet-devl fashion.  It's the
> way we respond to any new idea.  I still think it's a bad idea, so I'll
> go into more details.
> > > > that this has, and otherwise it uses splitfiles in a very effecient 
> > > > way.  
> > > > What if someone inserts a 600 MB ISO of Linux Mandrake 8.0, then wants 
> > > > to 
> > > > insert version 8.1?  If 80% of the segments of the first ISO match up 
> > > > to the 
> Mandrake ISOs are a bad example because there isn't anywhere near this
> much overlap, but I'll assume that there's some sort of huge file
> out there which has a lot of overlap with some other file, and talk
> about this hypothetical thing for the rest of this response.  (the more
> I think about it, the less likely I think that having two files that
> large and with such huge overlap is.)
> 
> > > > second one, then a node could store BOTH of the 600 MB ISOs with less 
> > > > than 
> > > > 700MB of storage (not including FEC, which even at 50% would bring it 
> > > > down to 
> > > > a total of 1.3 GB instead of 1.8 GB).  If the 8.0 ISO was well-founded 
> > > > in 
> > > > freenet, then the 8.1 ISO would use this, and requests for it would 
> > > > also 
> > > > improve the overlapping blocks on the 8.0 ISO.  If this is implemented 
> > > > in a 
> This is already what happens when you insert CHKs; when you insert two
> chunks of data that are the same, you get a CHK collision, and you can
> request that part of either file with the same key.
> 
> > > > manner that allows use of a list of splitfiles, then a program in an 
> > > > ISO of 
> > > > Debian could use the same blocks blocks for the program as an ISO of 
> > > > Mandrake 
> > > > and the same blocks as the program itself listed seperately.  The 
> > > > overlapping 
> > > > blocks could be obtained, if unavaliable, by using the redundant blocks 
> > > > of 
> > > > any of the sharing files to reconstruct those files and then caluclate 
> > > > the 
> > > > blocks; their redundency may be shared.  The biggest problem is in 
> > > > calling 
> Sharing redundancy would happen exactly when contant would be shared.
> There's no problems with this other than if you're going to try to force
> CHK collisions by finding substring matches in the two files with huge
> overlap, most FEC mechanisms require all the blocks to be the same size,
> so you're either wasting space (in the check blocks at least), or you're
> going to get a different set of check blocks for the two sets of data
> blocks (because order is important)
> 
> > > > the checksum at every offset.  This could be alleviated by including 
> > > > rolling 
> > > > checksums, such as the one used in rsync.  The rolling checksum could 
> > > > also be 
> > > > used to quickly validate the contets of the downloaded file so the 
> > > > client can 
> > > > tell if it needs to download some redundant blocks.
> 
> you don't really need rolling checksums; you're going to have to have
> access to both files locally in order to do this.  The checksum that you
> need to match isn't a rolling checksum but the SHA-1 hask used for CHKs.
> 
> > > > 
> > > > I got this idea when I was looking at the rsync algorithm and wondering 
> > > > how 
> > > > that type of functionality could benefit Freenet.  The rsync algorithm 
> > > > itself 
> > > > would only allow someone to, for example, synchronize their Mandrake 
> > > > 8.0 
> > > > image with the newer 8.1 image and only have to transfer a little more 
> > > > than 
> > > > the changed blocks.  This functionality could also be used in freenet 
> > > > (although not as effeciently because typical rsync block sizes range 
> > > > from 300 
> > > > bytes to 8kb and freenet now ranges from 256kb to 1mb).  Actually, if 
> > > > the 
> > > > ideas above are implemented and the user has previously downloaded the 
> > > > 8.0 
> > > > image, the overlapping blocks would already have been downloaded.  If 
> > > > it was 
> > > > downloaded off the web, then it could still do it in the classic 
> > > > rsync-like 
> > > > way (download the checksums, see what's the same, and download the 
> > > > differences).
> 
> I agree rsync is an awesome program.  I don't see any use for the
> concepts behind it in a freenet context, as rsync is for when you don't
> have both copies of the file, and there's a rsync process running on
> each copy of the file.  In freenet, you can't have a rsync process
> running on the copy that's in freenet, so you have to have it locally,
> and then you can use more traditional techniques to find similarities.
> > > > 
> > > > Also, next time try and point out why you think ideas won't work.  It 
> > > > makes 
> > > > your argument stronger and allows the ideas to improve (especially if 
> > > > they 
> > > > had some significant bad points in the first place).
> > > Because you'd have to compute the checksum of every possible range. Sure,
> > > rolling checksums help greatly, but you still end up with O(n^2) 
> > > execution.
> Rolling checksums aren't useful because you need to match CHKs, and
> there's no way we're dumbing down CHKs to the level needed to do a
> rolling checksum.  The only way to take advantage of existing data in
> freenet would be to calculate the CHK (decently expensive) of every
> section of the file that's the target size long (offset 0, offset 1,
> offset 2, etc.)  This is reasonably non-feasable for 600MB files.
> 
> > Also, if these are ISO images, you could insert the splitfile with a 2 kb
> > granularity? Of course, that means 600*500 = 300,000 files :) More realistic
> > would be to insert the individual files as CHKs... since files will normally
> > be packed on a CD-ROM, you can use a custom insert client that inserts like
> > this:
> > CHK 1       per-CD header info, directory, etc
> >     2       FILE1.DEB
> >     3       short buffer to get to the next file
> >     4       FILE2.DEB
> >     5       another short buffer
> >     6       FILE3.DEB
> > 
> 300,000 files is no good.  even with redundancy, freenet would take
> forever to download all those.  The average request (excluding xfer
> time) takes 10 seconds to complete.  Even if you saturated all 50
> connections your node could handle, it'd still take 60000 seconds to
> download all 300,000 pieces = about 16 hours.
> 
> > Etc. The individual DEB packages would match up with anywhere else they have
> > been inserted (for example as part of apt-get-over-freenet), giving better
> > efficiency, and would also match up with later/earlier/other distributions'
> > images. And it would all work with current clients (which allow variable 
> > part
> > sizes), with non-redundant splitting. The buffers would all be less than 
> > 2k, so
> > unlikely to go missing. One problem remains: if the packages are large, they
> > may require redundant splitfile insertion themselves. So, can we have a
> > non-redundant splitfile containing segments which are redundant splitfiles? 
> > Is
> > this a good idea?
> I don't agree that the small pieces are unlikely to go missing, but
> anyway...
> It's perfectly fine to have a non-redundant splitfile containing
> segments which are redundant splitfiles.  It'd probably be better to
> insert all the .deb files and then have a recipie file that contains all
> the "short buffers" and enough other info to piece together the original
> ISO given all the .debs.
How does this differ from the short buffers thing? The missing data in the
short buffers is all NULLs, or can be changed to all NULLs with no damage.
We don't need to insert separate short buffers for each inter-file gap, we can
just use blocks of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 null chars.
Incidentally, this also allows us to insert MP3s efficiently - ID3 tags are in
a header, not in the main stream, so have a client split the file up into
headers and the main stream, and we avoid duplication caused by different ID3
tags on the same track, while not requiring a custom client to retreive them.
Probably many other examples could benefit from such a scheme.
> > > > 
> > > > 
> > > > -Scott Young
> > > > 
> 
> Thelema
> -- 
> E-mail: thelema314 at bigfoot.com        If you love something, set it free.
> GPG 1536g/B9C5D1F7 fpr:075A A3F7 F70B 1397 345D  A67E 70AA 820B A806 F95D




-- 
The road to Tycho is paved with good intentions

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://lists.freenetproject.org/mailman/listinfo/devl

[freenet-devl] Redundant SplitFile Insertion / Retrieval for fproxy.

Reply via email to