Re: [Standards] [XEP-0234] File checking in JingleFT

2012-02-08 Thread Peter Saint-Andre
On 12/12/11 11:32 AM, Jefry Lagrange wrote:
 I was thinking that maybe we propose a separate XEP for file checking.
 Still, some changes should have to be done to JingleFT such as
 allowing range transfer reuse the session and add a limit attribute
 in the range file transfer. So that it can send chunks of data (with
 offset and limit).

Well, XEP-0234 is not finished, so should have been done is a bit
pessimistic. I have just updated the spec to track the latest changes to
XEP-0300 and I'd be happy to make further changes so that we can support
future checking algorithms. I agree that we also need to provide clearer
documentation of ranged transfers (e.g., reusing the Jingle session ID).

Peter

-- 
Peter Saint-Andre
https://stpeter.im/




Re: [Standards] [XEP-0234] File checking in JingleFT

2011-12-12 Thread Jefry Lagrange
I was thinking that maybe we propose a separate XEP for file checking.
Still, some changes should have to be done to JingleFT such as
allowing range transfer reuse the session and add a limit attribute
in the range file transfer. So that it can send chunks of data (with
offset and limit).

On Thu, Dec 8, 2011 at 3:10 PM, Waqas Hussain waqa...@gmail.com wrote:
 On Thu, Dec 8, 2011 at 6:48 AM, Jefry Lagrange jefry.re...@gmail.com wrote:
 So far what I've read in the XEP-0234 doesn't say anything about file
 checking. I suppose, it is implied that once the comparing of the
 hashes takes place, the program has to inform the user of a possible
 file corruption and then restart the file transfer manually.

 This is very inefficient, and awkward (for the user). Instead of
 resending the whole file, we should be able to detect in which part of
 the file there is corruption and resend that part using ranged
 transfers (as defined in
 http://xmpp.org/extensions/xep-0234.html#range).

 There are various issues with this. In the current implementation of
 jingleFT, range file transfer have to initiate a new session in order
 to transfer the file from a given offset. Ideally, we would want to
 use the current session.

 Example:

 * initiator sends file
 * responder receives it
 * initiator communicates the hash
 * responder checks, and notices that the file is corrupt. Finds where
 the corruption took place (using techniques explained later)
 * responder ask for ranged file transfer over the same session (note:
 we haven't terminated the session).
 * initiator resends the requested part of the file

 I'm not sure that only providing an offset for as a range can
 accomplish this, it would be probably be a better idea to provide a
 range as a range. I might be wrong about this, I'm not sure.

 Now, the question is, how do we know which part of the file is corrupted?

 One way to go about this is using hash-lists. Which is the way
 bittorrent makes sure there isn't corruption in the files being
 transfered.

 Basically, what we do with this is that we send the file in parts. For
 each part the initiator will send an IQ with the hash of that file
 part. The responder will respond with a result IQ or an error. In case
 there is an error from part of the responder, the initiator will
 resend that part.

 Easy right? The problem with this algorithm is that it is too
 expensive. It makes sense using it in p2p programs, because a file may
 be share to other peers before it is completed. It doesn't make very
 much sense using this in XMPP, where there is usually just one sender
 and one receiver. But this technique is very illustrative and easy to
 understand.

 Another alternative technique that we can use is binary-hashing (I
 made the name up, I don't know what this is called). Using this
 technique we will wait until the file is transfered as usual, then
 when we check the hashes if there is a mismatch we will splice the
 file in two parts, take the hash of both parts and ask the initiator
 for confirmation on those hashes, if one mismatches, that should tell
 us that the corrupt part is, and we continue repeating this process
 until the parts are small enough for them to be reasonable resent. If
 the file is small enough to being with, then we can skip this.

 The cost (in terms of computation) of this technique might be greater
 than hash-list. But the overall cost is many times smaller than
 hash-list, because we only need to use this if we find that the file
 is corrupt, which is an unlikely event.

 I would love to know what you guys think of this.

 --
 Jefry Lagrange

 I have always been interested in ranged transfers, and verification.
 I'm not sure how much interest there is in general XMPP community
 though.

 The way bittorrent and other file sharing protocols have been doing
 this for ages is quite interesting, and efficient.

 Here's is the specification of data contained in a bittorrent file:
  http://www.bittorrent.org/beps/bep_0003.html#metainfo-files-are-bencoded-dictionaries-with-the-following-keys

 And what you called 'binary-hashing' are Merkel trees, or Hash trees.
  http://en.wikipedia.org/wiki/Hash_tree

 A bittorrent extension for merkel trees was defined two years ago:
  http://www.bittorrent.org/beps/bep_0030.html

 In bittorrent, the size of a block of data is defined in the torrent
 file, specific to that torrent. A common size is 512KB.

 For a 1GB transfer, this would be 2048 blocks of 512KB each. Given
 SHA-1 as hash, this would be 20*2048 = 40KB of hashes. 53.3KB with
 base-64 encoding. Your suggested hash tree may be twice that size, at
 106.67KB, or it can be just 53.3KB by keeping hashes in all levels of
 the tree, not just leaves. See how BEP-0030 does it.

 A 1TB file though would have 53.3MB of hashes.. and just think how
 many IQs it would take if you are requesting subsection hashes, and
 how much CPU and I/O time it would take if you are calculating hashes
 of parts independently, 

Re: [Standards] [XEP-0234] File checking in JingleFT

2011-12-08 Thread Waqas Hussain
On Thu, Dec 8, 2011 at 6:48 AM, Jefry Lagrange jefry.re...@gmail.com wrote:
 So far what I've read in the XEP-0234 doesn't say anything about file
 checking. I suppose, it is implied that once the comparing of the
 hashes takes place, the program has to inform the user of a possible
 file corruption and then restart the file transfer manually.

 This is very inefficient, and awkward (for the user). Instead of
 resending the whole file, we should be able to detect in which part of
 the file there is corruption and resend that part using ranged
 transfers (as defined in
 http://xmpp.org/extensions/xep-0234.html#range).

 There are various issues with this. In the current implementation of
 jingleFT, range file transfer have to initiate a new session in order
 to transfer the file from a given offset. Ideally, we would want to
 use the current session.

 Example:

 * initiator sends file
 * responder receives it
 * initiator communicates the hash
 * responder checks, and notices that the file is corrupt. Finds where
 the corruption took place (using techniques explained later)
 * responder ask for ranged file transfer over the same session (note:
 we haven't terminated the session).
 * initiator resends the requested part of the file

 I'm not sure that only providing an offset for as a range can
 accomplish this, it would be probably be a better idea to provide a
 range as a range. I might be wrong about this, I'm not sure.

 Now, the question is, how do we know which part of the file is corrupted?

 One way to go about this is using hash-lists. Which is the way
 bittorrent makes sure there isn't corruption in the files being
 transfered.

 Basically, what we do with this is that we send the file in parts. For
 each part the initiator will send an IQ with the hash of that file
 part. The responder will respond with a result IQ or an error. In case
 there is an error from part of the responder, the initiator will
 resend that part.

 Easy right? The problem with this algorithm is that it is too
 expensive. It makes sense using it in p2p programs, because a file may
 be share to other peers before it is completed. It doesn't make very
 much sense using this in XMPP, where there is usually just one sender
 and one receiver. But this technique is very illustrative and easy to
 understand.

 Another alternative technique that we can use is binary-hashing (I
 made the name up, I don't know what this is called). Using this
 technique we will wait until the file is transfered as usual, then
 when we check the hashes if there is a mismatch we will splice the
 file in two parts, take the hash of both parts and ask the initiator
 for confirmation on those hashes, if one mismatches, that should tell
 us that the corrupt part is, and we continue repeating this process
 until the parts are small enough for them to be reasonable resent. If
 the file is small enough to being with, then we can skip this.

 The cost (in terms of computation) of this technique might be greater
 than hash-list. But the overall cost is many times smaller than
 hash-list, because we only need to use this if we find that the file
 is corrupt, which is an unlikely event.

 I would love to know what you guys think of this.

 --
 Jefry Lagrange

I have always been interested in ranged transfers, and verification.
I'm not sure how much interest there is in general XMPP community
though.

The way bittorrent and other file sharing protocols have been doing
this for ages is quite interesting, and efficient.

Here's is the specification of data contained in a bittorrent file:
  
http://www.bittorrent.org/beps/bep_0003.html#metainfo-files-are-bencoded-dictionaries-with-the-following-keys

And what you called 'binary-hashing' are Merkel trees, or Hash trees.
  http://en.wikipedia.org/wiki/Hash_tree

A bittorrent extension for merkel trees was defined two years ago:
  http://www.bittorrent.org/beps/bep_0030.html

In bittorrent, the size of a block of data is defined in the torrent
file, specific to that torrent. A common size is 512KB.

For a 1GB transfer, this would be 2048 blocks of 512KB each. Given
SHA-1 as hash, this would be 20*2048 = 40KB of hashes. 53.3KB with
base-64 encoding. Your suggested hash tree may be twice that size, at
106.67KB, or it can be just 53.3KB by keeping hashes in all levels of
the tree, not just leaves. See how BEP-0030 does it.

A 1TB file though would have 53.3MB of hashes.. and just think how
many IQs it would take if you are requesting subsection hashes, and
how much CPU and I/O time it would take if you are calculating hashes
of parts independently, and not the way bittorrent is doing it...
given that we are defining a protocol here, we have to keep scaling in
mind.

The simplest and most scalable solution would be to simply make a
merkel hash file available as a separate file transfer, and only
transfer the root hash in the jingle negotiation. This also allows
ranged transfer of only necessary parts of the hash file, and