Thanks Jean-Claude.. a couple of questions (below): ----- Original Message ----- From: "Jean-Claude Wippler" <[EMAIL PROTECTED]> To: "Metakit list" <[EMAIL PROTECTED]> Sent: Friday, January 02, 2004 1:49 PM Subject: Re: [Metakit] Question
chris mollis wrote: > I have a question about the best way to validate information on > reads/writes to the db. For example, I'd like to make sure that data > that is written out during a particular commit (by calculating a hash > of data written, perhaps) can be verified again when the database is > re-opened at a later date (possibly calculating the hash again and > then checking this against the previous hash). What do you recommend > to be the best way to do something like this? Should I override > DataWrite/DataRead methods of c4_FileStrategy to calculate hashes on > read and write operations? Good questions. There are several aspects to consider. The first one is really what sort of validation you are after: if you need to verify storage in general, then one could argue that there really is no other option than full file checksums, and even then it'll depend on the sort of validation as to when and how often you need to do it. Such checksums could be done outside MK, i.e. after commits and before opens. CM: agreed.. this was my first thought, but seemed a bit much if I just wanted to check "deltas" between reads and writes... Another point to be made is that MK is a database: it does not read or write all data each time the datafile is used. By validating writes on commit, you'll be checking only what it changed, not the entire datafile. Due to the way data is stored, the data written can be all over the datafile, it's not necessarily contiguous (though individual columns are). CM: right, this is what I thought.. I had a feeling it was going to be more difficult than I first thought :) It gets worse: MK usually loads data by mapping a file into memory. That means no "read" system calls take place at all in most cases: the data is mapped to a range of addresses and paged in via O/S page faults when "accessed", which is a matter of following pointers. CM: Ok.. I think I will need to disable memory-mapped IO (as you point out below).. If you really insist on doing this in some sort of fine-grained manner, my suggestion would be to use a custom c4_Strategy class as you mention yourself, in combination with a *second* MK datafile. The invariant is that MK always writes entire columns - I suspect that it is possible to detect the column boundaries written by intercepting DataWrite(). The main call comes from column.cpp line 1532. Or it may be necessary to introduce two extra strategy members which get called once in each call to c4_Column::SaveNow(): - strategy_.DataInit(pos_) - unmodified while (iter...) loop - strategy_.DataDone(_size) The DataInit would reset a checksum field in the strategy object (and remember pos_), the DataWrite calls would incrementally update the checksum, and the DataDone call would save a <pos,size,check> triple in the second MK datafile. It'll take some extra logic to make this work across multiple commits, i.e. when space gets re-used, but that ought to be doable. You may want to use hashed views for the secondary MK file, to make it snappy. CM: I'll have to work through an example of how this gets done and get back to you with questions.. I'm a bit confused about what actually gets written in the second file and how it's verified during access... (I'll have to test out some stuff before I can ask an intelligent question here :).. The most important problem to deal with is *when* to verify such saved checksums. If it has to be done during access, then I can't think of any other way than to disable memory mapping (by overriding c4_Strategy::ResetFileMapping with a dummy which does nothing). That makes MK slower and makes it use considerably more temp memory, however - so you'll have to think hard whether that is really what you want. If you just want to checksum occasionally, then you could iterate through all triples in the secondary MK file and verify each of the ranges. CM: again, I'm a bit confused as to what the "triples" are... I assume that these are just file markers (file position, size, along with the hash)... I'll try it out.. thanks for the advice! Another idea would be to save checksums per fixed-size block, say 4 Kb. That means DataWrite would track checksums, but it may need to read some data of the disk to deal with writes which are not exactly on block boundaries. This needs some thought to optimize, since most DataWrite calls will not be aligned nicely. Then again, DataWrite does get called in "mostly sequential" order, since it writes entire columns most of the time. -jcw _____________________________________________ Metakit mailing list - [EMAIL PROTECTED] http://www.equi4.com/mailman/listinfo/metakit _____________________________________________ Metakit mailing list - [EMAIL PROTECTED] http://www.equi4.com/mailman/listinfo/metakit
