On Apr 14, 2007, at 2:29 AM, Serguei Osokine wrote:

On Friday, April 13, 2007 David Andersen wrote:
I'm not on the list long-term, but will hang out for a sec if there
are questions.

        Actually, there are. Thank you for visiting!

        The biggest issue that I've seen mentioned so far seems to
be this one: how much of the MP3 similarity is due to the different
metadata in the otherwise identical files?

We haven't examined it in enough detail to answer this quantitatively, but almost all of the differences appear in the first or last ~16KB chunk of data. The ones we've manually inspected have been ID3 tag differences at the end or the beginning.

        Did you try comparing your aproach to the one where the MP3 file
hash is calculated without the metadata, on the data block alone? If
the files with the same data block and different metadata would be
considered by a P2P system to be the same file (as they really should
be - just like the identical files with different names), what will
happen to the transfer speed performance improvement numbers quoted
in the article?

Let me translate your question a bit:

If the transfer system just ignored MP3 metadata, would it get most of the benefits that we found for MP3s?

I'd wager it would. Based upon what we saw with some of the videos being changed at weird places in the middle of the file, there are probably a smaller %age of MP3s out there with weird "mutations", but most of the differences are probably just in the metadata.

(But it wouldn't do it with software, or video, or documents. I find myself leaning towards techniques that are general across any file type, as SET is, instead of tweaks for a particular media type, but that may just be me.)

The patterns of similarity in those other file types are extremely different. Video files were primarily language differences *or* changes in the middle of the file that we haven't explained yet. Software is big chunks of unchanged code between versions. Documents are fairly obvious - small, localized changes (with no particular pattern to where in the document) between versions.

        Sorry if I missed something in the article and that is exactly
how you came up with these speed improvement numbers to begin with.

No, you didn't miss anything in the article - the one that was forwarded here didn't contain a lot of detail. The original paper has more of the details.

But what I'm getting to, I'm trying to figure out how much of that
speed increase can be gained 'easily' - without any complicated code
changes, just by switching to the proper hashing technique.

Probably most of it for MP3s, almost none of it for other file types.

  -Dave



Attachment: PGP.sig
Description: This is a digitally signed message part

_______________________________________________
p2p-hackers mailing list
[EMAIL PROTECTED]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to