Re: Variable length pieces

Hampus Wessman Fri, 07 May 2010 10:05:46 -0700

I think it's an interesting idea. It would probably be best to create anew element similar to pieces, so that existing applications don't tryto interpret them wrong. Less confusing too.

You could add your own <pieces> element in another namespace, but Iwould suggest something like

<extendedpieces type="SHA-1">
  <hash length="875583">...</hash>
  <hash length="100000">...</hash>
  ...
</extendedpieces>. Or perhaps <flexiblepieces> or <varpieces> or similar.

It's probably a good idea to still require that the whole file iscovered by the hashes (ie no gaps, the first one starts at the beginningof the file and the last one ends at the end of the file). To getcomplete flexibility one could specify an offset (from the beginning ofthe file), a length and a hash type for each hash instead... Not surethat would be very useful, though.

If space efficiency is important, I think it's better to create a binary(or at least more compact) format and use some kind of compression onit. XML is quite verbose.

Would be cool to have this in the metalink standard (in the future) orin a documented extension of some kind.


Just my 1/50th of a dollar.

/ Hampus


On 05/07/2010 01:03 PM, petero wrote:

What are the thoughts on adding an optional attribute to the hash
element so that each piece can express its own length?


hi Peter,

I had thought something like this would be nice for things like music,
where if you edit the ID3 tags of an mp3, changing the artist or song,
you change the whole file's checksum, while not really changing the
important data at all.


Hi Anthony

Thanks! Interesting idea. If the apps creating the metalink pieces
further agreed on where to make those piece boundaries, in common
types of content (e.g. mp3): other apps could identify content that is
similar apart from its header and or footer. They could do this very
efficiently by just comparing piece info from the metalinks, rather
than by re-chunking and hashing each file's content themselves.

Once pieces have been identified as being the same across different
files, apps could identify more potential sources for particular
pieces, identify duplication within a distributed collection, find the
richest metadata/tags for particular content etc.

The pieces in the particular app I was originally referring to are
more similar to this:
http://www.hpl.hp.com/techreports/2005/HPL-2005-42R1.pdf
Finding Similar Files in Large Document Repositories
See 2.2 Chunking

"Content-based chunking, as introduced in [7], is a way of breaking a
file into a sequence of chunks so that chunk boundaries are determined
by the local contents of the file. This is in contrast to using fixed
size chunks, where chunk boundaries are determined by the distance
from the begin- ning of the file; inserting a single byte at the
beginning would change every chunk."

As the chunks could be small and many, it would be good if each of the
hashed pieces could express their own length in a space efficient
way...

I didn't quite follow the extension elements spec. Would you lean
towards extending the hash element to have an optional length
attribute? Or have a new element that is an alternative to pieces,
e.g. chunks, which has a list of hashes + lengths? It may be good if
examples of potential extensions esp variable-length pieces or chunks
were hinted at in the spec to gain interest in their standardization
and adoption?


--
You received this message because you are subscribed to the Google Groups "Metalink 
Discussion" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Re: Variable length pieces

Reply via email to