David -- Can you give a quick, layman's-terms overview of how you do the
Rabin fingerprinting?  I'm not familiar with the technique.  Do you
establish some "goal" (like "break this file up into about n roughly-equal
pieces") and somehow it comes up with an optimal separator?

Also, on the topic of similarity, have you determined why 15% of movie
translations are identical?  At best I'd imagine the portions that have no
dialog could be the same (and maybe that accounts for 15% right there), but
due to audio/video interleaving the rest would seem to have very low
similarity.

-david

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:p2p-hackers-
> [EMAIL PROTECTED] On Behalf Of David Andersen
> Sent: Saturday, April 14, 2007 12:14 AM
> To: [EMAIL PROTECTED]; theory and practice of decentralized computer
> networks
> Subject: Re: [p2p-hackers] Computer scientists develop P2P system
> thatpromisesfaster music, movie downloads
> 
> On Apr 14, 2007, at 2:29 AM, Serguei Osokine wrote:
> 
> > On Friday, April 13, 2007 David Andersen wrote:
> >> I'm not on the list long-term, but will hang out for a sec if there
> >> are questions.
> >
> >     Actually, there are. Thank you for visiting!
> >
> >     The biggest issue that I've seen mentioned so far seems to
> > be this one: how much of the MP3 similarity is due to the different
> > metadata in the otherwise identical files?
> 
> We haven't examined it in enough detail to answer this
> quantitatively, but almost all of the differences appear in the first
> or last ~16KB chunk of data.  The ones we've manually inspected have
> been ID3 tag differences at the end or the beginning.
> 
> >     Did you try comparing your aproach to the one where the MP3 file
> > hash is calculated without the metadata, on the data block alone? If
> > the files with the same data block and different metadata would be
> > considered by a P2P system to be the same file (as they really should
> > be - just like the identical files with different names), what will
> > happen to the transfer speed performance improvement numbers quoted
> > in the article?
> 
> Let me translate your question a bit:
> 
> If the transfer system just ignored MP3 metadata, would it get most
> of the benefits that we found for MP3s?
> 
> I'd wager it would.  Based upon what we saw with some of the videos
> being changed at weird places in the middle of the file, there are
> probably a smaller %age of MP3s out there with weird "mutations", but
> most of the differences are probably just in the metadata.
> 
> (But it wouldn't do it with software, or video, or documents.  I find
> myself leaning towards techniques that are general across any file
> type, as SET is, instead of tweaks for a particular media type, but
> that may just be me.)
> 
> The patterns of similarity in those other file types are extremely
> different.  Video files were primarily language differences *or*
> changes in the middle of the file that we haven't explained yet.
> Software is big chunks of unchanged code between versions.  Documents
> are fairly obvious - small, localized changes (with no particular
> pattern to where in the document) between versions.
> 
> >     Sorry if I missed something in the article and that is exactly
> > how you came up with these speed improvement numbers to begin with.
> 
> No, you didn't miss anything in the article - the one that was
> forwarded here didn't contain a lot of detail.  The original paper
> has more of the details.
> 
> > But what I'm getting to, I'm trying to figure out how much of that
> > speed increase can be gained 'easily' - without any complicated code
> > changes, just by switching to the proper hashing technique.
> 
> Probably most of it for MP3s, almost none of it for other file types.
> 
>    -Dave
> 
> 


_______________________________________________
p2p-hackers mailing list
[EMAIL PROTECTED]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to