Thanks Jean-Claude..

a couple of questions (below):
----- Original Message -----
From: "Jean-Claude Wippler" <[EMAIL PROTECTED]>
To: "Metakit list" <[EMAIL PROTECTED]>
Sent: Friday, January 02, 2004 1:49 PM
Subject: Re: [Metakit] Question


chris mollis wrote:

> I have a question about the best way to validate information on
> reads/writes to the db. For example, I'd like to make sure that data
> that is written out during a particular commit (by calculating a hash
> of data written, perhaps) can be verified again when the database is
> re-opened at a later date (possibly calculating the hash again and
> then checking this against the previous hash). What do you recommend
> to be the best way to do something like this? Should I override
> DataWrite/DataRead methods of c4_FileStrategy to calculate hashes on
> read and write operations?

Good questions.  There are several aspects to consider.  The first one
is really what sort of validation you are after: if you need to verify
storage in general, then one could argue that there really is no other
option than full file checksums, and even then it'll depend on the sort
of validation as to when and how often you need to do it.  Such
checksums could be done outside MK, i.e. after commits and before
opens.

CM:  agreed.. this was my first thought, but seemed a bit much if I just
wanted
to check "deltas" between reads and writes...

Another point to be made is that MK is a database: it does not read or
write all data each time the datafile is used.  By validating writes on
commit, you'll be checking only what it changed, not the entire
datafile.  Due to the way data is stored, the data written can be all
over the datafile, it's not necessarily contiguous (though individual
columns are).

CM:  right, this is what I thought..  I had a feeling it was going to be
more difficult than
I first thought :)

It gets worse: MK usually loads data by mapping a file into memory.
That means no "read" system calls take place at all in most cases: the
data is mapped to a range of addresses and paged in via O/S page faults
when "accessed", which is a matter of following pointers.

CM:  Ok.. I think I will need to disable memory-mapped IO (as you point out
below)..

If you really insist on doing this in some sort of fine-grained manner,
my suggestion would be to use a custom c4_Strategy class as you mention
yourself, in combination with a *second* MK datafile.  The invariant is
that MK always writes entire columns - I suspect that it is possible to
detect the column boundaries written by intercepting DataWrite().  The
main call comes from column.cpp line 1532.  Or it may be necessary to
introduce two extra strategy members which get called once in each call
to c4_Column::SaveNow():
- strategy_.DataInit(pos_)
- unmodified while (iter...) loop
- strategy_.DataDone(_size)
The DataInit would reset a checksum field in the strategy object (and
remember pos_), the DataWrite calls would incrementally update the
checksum, and the DataDone call would save a <pos,size,check> triple in
the second MK datafile.  It'll take some extra logic to make this work
across multiple commits, i.e. when space gets re-used, but that ought
to be doable.  You may want to use hashed views for the secondary MK
file, to make it snappy.

CM: I'll have to work through an example of how this gets done and get back
to you
with questions.. I'm a bit confused about what actually gets written in the
second file and
how it's verified during access... (I'll have to test out some stuff before
I can ask an intelligent question
here :)..



The most important problem to deal with is *when* to verify such saved
checksums.  If it has to be done during access, then I can't think of
any other way than to disable memory mapping (by overriding
c4_Strategy::ResetFileMapping with a dummy which does nothing).  That
makes MK slower and makes it use considerably more temp memory, however
- so you'll have to think hard whether that is really what you want.


If you just want to checksum occasionally, then you could iterate
through all triples in the secondary MK file and verify each of the
ranges.

CM:  again, I'm a bit confused as to what the "triples" are... I assume that
these are just file
markers (file position, size, along with the hash)...  I'll try it out..
thanks for the advice!

Another idea would be to save checksums per fixed-size block, say 4 Kb.
  That means DataWrite would track checksums, but it may need to read
some data of the disk to deal with writes which are not exactly on
block boundaries.  This needs some thought to optimize, since most
DataWrite calls will not be aligned nicely.  Then again, DataWrite does
get called in "mostly sequential" order, since it writes entire columns
most of the time.

-jcw

_____________________________________________
Metakit mailing list  -  [EMAIL PROTECTED]
http://www.equi4.com/mailman/listinfo/metakit



_____________________________________________
Metakit mailing list  -  [EMAIL PROTECTED]
http://www.equi4.com/mailman/listinfo/metakit

Reply via email to