I think it's an interesting idea. It would probably be best to create a
new element similar to pieces, so that existing applications don't try
to interpret them wrong. Less confusing too.
You could add your own <pieces> element in another namespace, but I
would suggest something like
<extendedpieces type="SHA-1">
<hash length="875583">...</hash>
<hash length="100000">...</hash>
...
</extendedpieces>. Or perhaps <flexiblepieces> or <varpieces> or similar.
It's probably a good idea to still require that the whole file is
covered by the hashes (ie no gaps, the first one starts at the beginning
of the file and the last one ends at the end of the file). To get
complete flexibility one could specify an offset (from the beginning of
the file), a length and a hash type for each hash instead... Not sure
that would be very useful, though.
If space efficiency is important, I think it's better to create a binary
(or at least more compact) format and use some kind of compression on
it. XML is quite verbose.
Would be cool to have this in the metalink standard (in the future) or
in a documented extension of some kind.
Just my 1/50th of a dollar.
/ Hampus
On 05/07/2010 01:03 PM, petero wrote:
What are the thoughts on adding an optional attribute to the hash
element so that each piece can express its own length?
hi Peter,
I had thought something like this would be nice for things like music,
where if you edit the ID3 tags of an mp3, changing the artist or song,
you change the whole file's checksum, while not really changing the
important data at all.
Hi Anthony
Thanks! Interesting idea. If the apps creating the metalink pieces
further agreed on where to make those piece boundaries, in common
types of content (e.g. mp3): other apps could identify content that is
similar apart from its header and or footer. They could do this very
efficiently by just comparing piece info from the metalinks, rather
than by re-chunking and hashing each file's content themselves.
Once pieces have been identified as being the same across different
files, apps could identify more potential sources for particular
pieces, identify duplication within a distributed collection, find the
richest metadata/tags for particular content etc.
The pieces in the particular app I was originally referring to are
more similar to this:
http://www.hpl.hp.com/techreports/2005/HPL-2005-42R1.pdf
Finding Similar Files in Large Document Repositories
See 2.2 Chunking
"Content-based chunking, as introduced in [7], is a way of breaking a
file into a sequence of chunks so that chunk boundaries are determined
by the local contents of the file. This is in contrast to using fixed
size chunks, where chunk boundaries are determined by the distance
from the begin- ning of the file; inserting a single byte at the
beginning would change every chunk."
As the chunks could be small and many, it would be good if each of the
hashed pieces could express their own length in a space efficient
way...
I didn't quite follow the extension elements spec. Would you lean
towards extending the hash element to have an optional length
attribute? Or have a new element that is an alternative to pieces,
e.g. chunks, which has a list of hashes + lengths? It may be good if
examples of potential extensions esp variable-length pieces or chunks
were hinted at in the spec to gain interest in their standardization
and adoption?
--
You received this message because you are subscribed to the Google Groups "Metalink
Discussion" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/metalink-discussion?hl=en.