On Jun 30, 2009, at 11:22 AM, Noah Slater wrote:

On Tue, Jun 30, 2009 at 07:12:07AM -0400, Damien Katz wrote:
Im not sure I understand why we can't just calculate and send the MD5
header for the content range.

We could, but are you not proposing that we use this value for the document revision? If that is the case, when you do range requests, the hash sent back doesn't actually correspond to anything. If I used the hash from the final range
request of a document to post an update, it would presumably fail.

To clarify, the point of deterministic rev ids is only to avoid unnecessary conflicts when the identical edits are made on 2 different replicas. If the content was identical when editing the same revision, it should not be a conflict. If we had a canonical representation of the document, we could also use the determanistic rev ids for integrity checking, but we don't have a canonical representation, and creating one is very difficult to get right.

What I'm proposing is that we only use content-MD5 for payload integrity checking. It will not being used for security and it cannot be validated against the rev id because they will always be different. The rev Id will be generated based on the erlang term format of the document, not the UTF8 JSON string that gets sent to the client.

So the server will send it's responses (perhaps optionally) with a MD5 hash to detect packet corruption. Clients, when they send docs and attachments, can send the payload with a content-MD5 header and the server will check it to make sure it's uncorrupted. As it writes the data to disk the server will compute the MD5 hash, for it's own integrity checking later.

So for example, the replicator will check the md5 sig from the server and send it's own md5 sig when writing data. This prevents network problems from introducing corruptions to data as it replicates.

-Damien


Best,

--
Noah Slater, http://tumbolia.org/nslater

Reply via email to