David -- Can you give a quick, layman's-terms overview of how you do the Rabin fingerprinting? I'm not familiar with the technique. Do you establish some "goal" (like "break this file up into about n roughly-equal pieces") and somehow it comes up with an optimal separator?
Also, on the topic of similarity, have you determined why 15% of movie translations are identical? At best I'd imagine the portions that have no dialog could be the same (and maybe that accounts for 15% right there), but due to audio/video interleaving the rest would seem to have very low similarity. -david > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:p2p-hackers- > [EMAIL PROTECTED] On Behalf Of David Andersen > Sent: Saturday, April 14, 2007 12:14 AM > To: [EMAIL PROTECTED]; theory and practice of decentralized computer > networks > Subject: Re: [p2p-hackers] Computer scientists develop P2P system > thatpromisesfaster music, movie downloads > > On Apr 14, 2007, at 2:29 AM, Serguei Osokine wrote: > > > On Friday, April 13, 2007 David Andersen wrote: > >> I'm not on the list long-term, but will hang out for a sec if there > >> are questions. > > > > Actually, there are. Thank you for visiting! > > > > The biggest issue that I've seen mentioned so far seems to > > be this one: how much of the MP3 similarity is due to the different > > metadata in the otherwise identical files? > > We haven't examined it in enough detail to answer this > quantitatively, but almost all of the differences appear in the first > or last ~16KB chunk of data. The ones we've manually inspected have > been ID3 tag differences at the end or the beginning. > > > Did you try comparing your aproach to the one where the MP3 file > > hash is calculated without the metadata, on the data block alone? If > > the files with the same data block and different metadata would be > > considered by a P2P system to be the same file (as they really should > > be - just like the identical files with different names), what will > > happen to the transfer speed performance improvement numbers quoted > > in the article? > > Let me translate your question a bit: > > If the transfer system just ignored MP3 metadata, would it get most > of the benefits that we found for MP3s? > > I'd wager it would. Based upon what we saw with some of the videos > being changed at weird places in the middle of the file, there are > probably a smaller %age of MP3s out there with weird "mutations", but > most of the differences are probably just in the metadata. > > (But it wouldn't do it with software, or video, or documents. I find > myself leaning towards techniques that are general across any file > type, as SET is, instead of tweaks for a particular media type, but > that may just be me.) > > The patterns of similarity in those other file types are extremely > different. Video files were primarily language differences *or* > changes in the middle of the file that we haven't explained yet. > Software is big chunks of unchanged code between versions. Documents > are fairly obvious - small, localized changes (with no particular > pattern to where in the document) between versions. > > > Sorry if I missed something in the article and that is exactly > > how you came up with these speed improvement numbers to begin with. > > No, you didn't miss anything in the article - the one that was > forwarded here didn't contain a lot of detail. The original paper > has more of the details. > > > But what I'm getting to, I'm trying to figure out how much of that > > speed increase can be gained 'easily' - without any complicated code > > changes, just by switching to the proper hashing technique. > > Probably most of it for MP3s, almost none of it for other file types. > > -Dave > > _______________________________________________ p2p-hackers mailing list [EMAIL PROTECTED] http://lists.zooko.com/mailman/listinfo/p2p-hackers
