Re: [Standards] [XEP-0234] File checking in JingleFT

Waqas Hussain Thu, 08 Dec 2011 12:11:07 -0800

On Thu, Dec 8, 2011 at 6:48 AM, Jefry Lagrange <jefry.re...@gmail.com> wrote:
> So far what I've read in the XEP-0234 doesn't say anything about file
> checking. I suppose, it is implied that once the comparing of the
> hashes takes place, the program has to inform the user of a possible
> file corruption and then restart the file transfer manually.
>
> This is very inefficient, and awkward (for the user). Instead of
> resending the whole file, we should be able to detect in which part of
> the file there is corruption and resend that part using ranged
> transfers (as defined in
> http://xmpp.org/extensions/xep-0234.html#range).
>
> There are various issues with this. In the current implementation of
> jingleFT, range file transfer have to initiate a new session in order
> to transfer the file from a given offset. Ideally, we would want to
> use the current session.
>
> Example:
>
> * initiator sends file
> * responder receives it
> * initiator communicates the hash
> * responder checks, and notices that the file is corrupt. Finds where
> the corruption took place (using techniques explained later)
> * responder ask for ranged file transfer over the same session (note:
> we haven't terminated the session).
> * initiator resends the requested part of the file
>
> I'm not sure that only providing an offset for as a range can
> accomplish this, it would be probably be a better idea to provide a
> range as a range. I might be wrong about this, I'm not sure.
>
> Now, the question is, how do we know which part of the file is corrupted?
>
> One way to go about this is using hash-lists. Which is the way
> bittorrent makes sure there isn't corruption in the files being
> transfered.
>
> Basically, what we do with this is that we send the file in parts. For
> each part the initiator will send an IQ with the hash of that file
> part. The responder will respond with a result IQ or an error. In case
> there is an error from part of the responder, the initiator will
> resend that part.
>
> Easy right? The problem with this algorithm is that it is too
> expensive. It makes sense using it in p2p programs, because a file may
> be share to other peers before it is completed. It doesn't make very
> much sense using this in XMPP, where there is usually just one sender
> and one receiver. But this technique is very illustrative and easy to
> understand.
>
> Another alternative technique that we can use is binary-hashing (I
> made the name up, I don't know what this is called). Using this
> technique we will wait until the file is transfered as usual, then
> when we check the hashes if there is a mismatch we will splice the
> file in two parts, take the hash of both parts and ask the initiator
> for confirmation on those hashes, if one mismatches, that should tell
> us that the corrupt part is, and we continue repeating this process
> until the parts are small enough for them to be reasonable resent. If
> the file is small enough to being with, then we can skip this.
>
> The cost (in terms of computation) of this technique might be greater
> than hash-list. But the overall cost is many times smaller than
> hash-list, because we only need to use this if we find that the file
> is corrupt, which is an unlikely event.
>
> I would love to know what you guys think of this.
>
> --
> Jefry Lagrange

I have always been interested in ranged transfers, and verification.
I'm not sure how much interest there is in general XMPP community
though.

The way bittorrent and other file sharing protocols have been doing
this for ages is quite interesting, and efficient.

Here's is the specification of data contained in a bittorrent file:

http://www.bittorrent.org/beps/bep_0003.html#metainfo-files-are-bencoded-dictionaries-with-the-following-keys

And what you called 'binary-hashing' are Merkel trees, or Hash trees.
http://en.wikipedia.org/wiki/Hash_tree

A bittorrent extension for merkel trees was defined two years ago:
http://www.bittorrent.org/beps/bep_0030.html

In bittorrent, the size of a block of data is defined in the torrent
file, specific to that torrent. A common size is 512KB.

For a 1GB transfer, this would be 2048 blocks of 512KB each. Given
SHA-1 as hash, this would be 20*2048 = 40KB of hashes. 53.3KB with
base-64 encoding. Your suggested hash tree may be twice that size, at
106.67KB, or it can be just 53.3KB by keeping hashes in all levels of
the tree, not just leaves. See how BEP-0030 does it.

A 1TB file though would have 53.3MB of hashes.. and just think how
many IQs it would take if you are requesting subsection hashes, and
how much CPU and I/O time it would take if you are calculating hashes
of parts independently, and not the way bittorrent is doing it...
given that we are defining a protocol here, we have to keep scaling in
mind.

The simplest and most scalable solution would be to simply make a
merkel hash file available as a separate file transfer, and only
transfer the root hash in the jingle negotiation. This also allows
ranged transfer of only necessary parts of the hash file, and doesn't
need base64 encoding.

Do we have anything like Jingle URIs? Something like XEP-0231: Bits of
Binary, but for Jingle. That would be useful in general, and for this
specific solution.

--
Waqas Hussain

Re: [Standards] [XEP-0234] File checking in JingleFT

Reply via email to