On Jun 30, 2009, at 11:22 AM, Noah Slater wrote:
On Tue, Jun 30, 2009 at 07:12:07AM -0400, Damien Katz wrote:
Im not sure I understand why we can't just calculate and send the MD5
header for the content range.
We could, but are you not proposing that we use this value for the
document
revision? If that is the case, when you do range requests, the hash
sent back
doesn't actually correspond to anything. If I used the hash from the
final range
request of a document to post an update, it would presumably fail.
To clarify, the point of deterministic rev ids is only to avoid
unnecessary conflicts when the identical edits are made on 2 different
replicas. If the content was identical when editing the same revision,
it should not be a conflict. If we had a canonical representation of
the document, we could also use the determanistic rev ids for
integrity checking, but we don't have a canonical representation, and
creating one is very difficult to get right.
What I'm proposing is that we only use content-MD5 for payload
integrity checking. It will not being used for security and it cannot
be validated against the rev id because they will always be different.
The rev Id will be generated based on the erlang term format of the
document, not the UTF8 JSON string that gets sent to the client.
So the server will send it's responses (perhaps optionally) with a MD5
hash to detect packet corruption. Clients, when they send docs and
attachments, can send the payload with a content-MD5 header and the
server will check it to make sure it's uncorrupted. As it writes the
data to disk the server will compute the MD5 hash, for it's own
integrity checking later.
So for example, the replicator will check the md5 sig from the server
and send it's own md5 sig when writing data. This prevents network
problems from introducing corruptions to data as it replicates.
-Damien
Best,
--
Noah Slater, http://tumbolia.org/nslater