I was thinking that maybe we propose a separate XEP for file checking. Still, some changes should have to be done to JingleFT such as allowing range transfer reuse the session and add a "limit" attribute in the range file transfer. So that it can send chunks of data (with offset and limit).
On Thu, Dec 8, 2011 at 3:10 PM, Waqas Hussain <waqa...@gmail.com> wrote: > On Thu, Dec 8, 2011 at 6:48 AM, Jefry Lagrange <jefry.re...@gmail.com> wrote: >> So far what I've read in the XEP-0234 doesn't say anything about file >> checking. I suppose, "it is implied that once the comparing of the >> hashes takes place, the program has to inform the user of a possible >> file corruption and then restart the file transfer manually. >> >> This is very inefficient, and awkward (for the user). Instead of >> resending the whole file, we should be able to detect in which part of >> the file there is corruption and resend that part using ranged >> transfers (as defined in >> http://xmpp.org/extensions/xep-0234.html#range). >> >> There are various issues with this. In the current implementation of >> jingleFT, range file transfer have to initiate a new session in order >> to transfer the file from a given offset. Ideally, we would want to >> use the current session. >> >> Example: >> >> * initiator sends file >> * responder receives it >> * initiator communicates the hash >> * responder checks, and notices that the file is corrupt. Finds where >> the corruption took place (using techniques explained later) >> * responder ask for ranged file transfer over the same session (note: >> we haven't terminated the session). >> * initiator resends the requested part of the file >> >> I'm not sure that only providing an offset for as a range can >> accomplish this, it would be probably be a better idea to provide a >> range as a range. I might be wrong about this, I'm not sure. >> >> Now, the question is, how do we know which part of the file is corrupted? >> >> One way to go about this is using hash-lists. Which is the way >> bittorrent makes sure there isn't corruption in the files being >> transfered. >> >> Basically, what we do with this is that we send the file in parts. For >> each part the initiator will send an IQ with the hash of that file >> part. The responder will respond with a result IQ or an error. In case >> there is an error from part of the responder, the initiator will >> resend that part. >> >> Easy right? The problem with this algorithm is that it is too >> expensive. It makes sense using it in p2p programs, because a file may >> be share to other peers before it is completed. It doesn't make very >> much sense using this in XMPP, where there is usually just one sender >> and one receiver. But this technique is very illustrative and easy to >> understand. >> >> Another alternative technique that we can use is binary-hashing (I >> made the name up, I don't know what this is called). Using this >> technique we will wait until the file is transfered as usual, then >> when we check the hashes if there is a mismatch we will splice the >> file in two parts, take the hash of both parts and ask the initiator >> for confirmation on those hashes, if one mismatches, that should tell >> us that the corrupt part is, and we continue repeating this process >> until the parts are small enough for them to be reasonable resent. If >> the file is small enough to being with, then we can skip this. >> >> The cost (in terms of computation) of this technique might be greater >> than hash-list. But the overall cost is many times smaller than >> hash-list, because we only need to use this if we find that the file >> is corrupt, which is an unlikely event. >> >> I would love to know what you guys think of this. >> >> -- >> Jefry Lagrange > > I have always been interested in ranged transfers, and verification. > I'm not sure how much interest there is in general XMPP community > though. > > The way bittorrent and other file sharing protocols have been doing > this for ages is quite interesting, and efficient. > > Here's is the specification of data contained in a bittorrent file: > http://www.bittorrent.org/beps/bep_0003.html#metainfo-files-are-bencoded-dictionaries-with-the-following-keys > > And what you called 'binary-hashing' are Merkel trees, or Hash trees. > http://en.wikipedia.org/wiki/Hash_tree > > A bittorrent extension for merkel trees was defined two years ago: > http://www.bittorrent.org/beps/bep_0030.html > > In bittorrent, the size of a block of data is defined in the torrent > file, specific to that torrent. A common size is 512KB. > > For a 1GB transfer, this would be 2048 blocks of 512KB each. Given > SHA-1 as hash, this would be 20*2048 = 40KB of hashes. 53.3KB with > base-64 encoding. Your suggested hash tree may be twice that size, at > 106.67KB, or it can be just 53.3KB by keeping hashes in all levels of > the tree, not just leaves. See how BEP-0030 does it. > > A 1TB file though would have 53.3MB of hashes.. and just think how > many IQs it would take if you are requesting subsection hashes, and > how much CPU and I/O time it would take if you are calculating hashes > of parts independently, and not the way bittorrent is doing it... > given that we are defining a protocol here, we have to keep scaling in > mind. > > The simplest and most scalable solution would be to simply make a > merkel hash file available as a separate file transfer, and only > transfer the root hash in the jingle negotiation. This also allows > ranged transfer of only necessary parts of the hash file, and doesn't > need base64 encoding. > > Do we have anything like Jingle URIs? Something like XEP-0231: Bits of > Binary, but for Jingle. That would be useful in general, and for this > specific solution. > > -- > Waqas Hussain -- Jefry Lagrange