I was thinking that maybe we propose a separate XEP for file checking.
Still, some changes should have to be done to JingleFT such as
allowing range transfer reuse the session and add a "limit" attribute
in the range file transfer. So that it can send chunks of data (with
offset and limit).

On Thu, Dec 8, 2011 at 3:10 PM, Waqas Hussain <waqa...@gmail.com> wrote:
> On Thu, Dec 8, 2011 at 6:48 AM, Jefry Lagrange <jefry.re...@gmail.com> wrote:
>> So far what I've read in the XEP-0234 doesn't say anything about file
>> checking. I suppose, "it is implied that once the comparing of the
>> hashes takes place, the program has to inform the user of a possible
>> file corruption and then restart the file transfer manually.
>>
>> This is very inefficient, and awkward (for the user). Instead of
>> resending the whole file, we should be able to detect in which part of
>> the file there is corruption and resend that part using ranged
>> transfers (as defined in
>> http://xmpp.org/extensions/xep-0234.html#range).
>>
>> There are various issues with this. In the current implementation of
>> jingleFT, range file transfer have to initiate a new session in order
>> to transfer the file from a given offset. Ideally, we would want to
>> use the current session.
>>
>> Example:
>>
>> * initiator sends file
>> * responder receives it
>> * initiator communicates the hash
>> * responder checks, and notices that the file is corrupt. Finds where
>> the corruption took place (using techniques explained later)
>> * responder ask for ranged file transfer over the same session (note:
>> we haven't terminated the session).
>> * initiator resends the requested part of the file
>>
>> I'm not sure that only providing an offset for as a range can
>> accomplish this, it would be probably be a better idea to provide a
>> range as a range. I might be wrong about this, I'm not sure.
>>
>> Now, the question is, how do we know which part of the file is corrupted?
>>
>> One way to go about this is using hash-lists. Which is the way
>> bittorrent makes sure there isn't corruption in the files being
>> transfered.
>>
>> Basically, what we do with this is that we send the file in parts. For
>> each part the initiator will send an IQ with the hash of that file
>> part. The responder will respond with a result IQ or an error. In case
>> there is an error from part of the responder, the initiator will
>> resend that part.
>>
>> Easy right? The problem with this algorithm is that it is too
>> expensive. It makes sense using it in p2p programs, because a file may
>> be share to other peers before it is completed. It doesn't make very
>> much sense using this in XMPP, where there is usually just one sender
>> and one receiver. But this technique is very illustrative and easy to
>> understand.
>>
>> Another alternative technique that we can use is binary-hashing (I
>> made the name up, I don't know what this is called). Using this
>> technique we will wait until the file is transfered as usual, then
>> when we check the hashes if there is a mismatch we will splice the
>> file in two parts, take the hash of both parts and ask the initiator
>> for confirmation on those hashes, if one mismatches, that should tell
>> us that the corrupt part is, and we continue repeating this process
>> until the parts are small enough for them to be reasonable resent. If
>> the file is small enough to being with, then we can skip this.
>>
>> The cost (in terms of computation) of this technique might be greater
>> than hash-list. But the overall cost is many times smaller than
>> hash-list, because we only need to use this if we find that the file
>> is corrupt, which is an unlikely event.
>>
>> I would love to know what you guys think of this.
>>
>> --
>> Jefry Lagrange
>
> I have always been interested in ranged transfers, and verification.
> I'm not sure how much interest there is in general XMPP community
> though.
>
> The way bittorrent and other file sharing protocols have been doing
> this for ages is quite interesting, and efficient.
>
> Here's is the specification of data contained in a bittorrent file:
>  http://www.bittorrent.org/beps/bep_0003.html#metainfo-files-are-bencoded-dictionaries-with-the-following-keys
>
> And what you called 'binary-hashing' are Merkel trees, or Hash trees.
>  http://en.wikipedia.org/wiki/Hash_tree
>
> A bittorrent extension for merkel trees was defined two years ago:
>  http://www.bittorrent.org/beps/bep_0030.html
>
> In bittorrent, the size of a block of data is defined in the torrent
> file, specific to that torrent. A common size is 512KB.
>
> For a 1GB transfer, this would be 2048 blocks of 512KB each. Given
> SHA-1 as hash, this would be 20*2048 = 40KB of hashes. 53.3KB with
> base-64 encoding. Your suggested hash tree may be twice that size, at
> 106.67KB, or it can be just 53.3KB by keeping hashes in all levels of
> the tree, not just leaves. See how BEP-0030 does it.
>
> A 1TB file though would have 53.3MB of hashes.. and just think how
> many IQs it would take if you are requesting subsection hashes, and
> how much CPU and I/O time it would take if you are calculating hashes
> of parts independently, and not the way bittorrent is doing it...
> given that we are defining a protocol here, we have to keep scaling in
> mind.
>
> The simplest and most scalable solution would be to simply make a
> merkel hash file available as a separate file transfer, and only
> transfer the root hash in the jingle negotiation. This also allows
> ranged transfer of only necessary parts of the hash file, and doesn't
> need base64 encoding.
>
> Do we have anything like Jingle URIs? Something like XEP-0231: Bits of
> Binary, but for Jingle. That would be useful in general, and for this
> specific solution.
>
> --
> Waqas Hussain



-- 
Jefry Lagrange

Reply via email to