Re: [Dbmail-dev] header storage schema changes. (W)Here we (may) go.

Matthew T. O'Connor Mon, 3 Jan 2005 10:43:42 +0100 (CET)

Matthew T. O'Connor wrote:

header_list:  ( Contains an exhaustive list of all headers from all
messages in the database. )


[...]

header_values: ( Contains the values from all the headers in all the
messages in database )


[...]

This structure will make it very easy to query all the headers from a
given message or find all the messages with a given header, or a
given header value.  It also leaves our current structure intact
which will make it easier to phase in.


It will work, but you still require a lookup on an extra table to do
header searches. There has to be hidden
cost there. Also the added db-interaction at insertion time is very
significant.


The cost of looking up one row in a small table (100-200 rows?) is almost
zero vs the cost of doing a string comparison several million times.

What I like about the yukatan approach is that it tries to make searches
on very common headers as cheap and
fast as possible. It does this by going one step beyond separate header
storage: preparse certain headers for
common attributes: the in-reply-to and references headers will be used for
threading. These headers contain
one or more message-id header values, which are stored separately. With
this approach building message-threads
can be done with fully indexed, single-table queries! Of course a union
with the msgids from the opened
mailbox is still required, but you can't beat such a setup wrt threading.


Perhaps I'm missing something here, but can't my two table solution do
basically the same thing?  In fact by having the header_list table we are
in effect pre-parsing the headers already.  It's easy to write a query
that grabs all the messages in a given thread using those two headers
without splitting them into separate tables.  The relevant argument is
will it be fast enough, and I think it will.

What do you think?  I don't think we need to special case any headers
not even sendername or subject.


The more I think about this, the more convinced I am that we do need to
treat common headers differently in
the end. And your suggestion to preseed the header_names table and
hardcode their ids tells me we're agreed here.


Somewhat agreed  :-)  I do agree that specialized solution designed to

handle certain headers should be somewhat faster then a generic solution.I don't necessarily agree (yet) that we need that speed. I think the

solution I'm suggesting is going to be very fast already.

Don't forget the yukatan database design does have a generic headers table
much like I'm suggesting.  If we add the generic headers table, we can
always add the specific header tables later if and when it's proven that
we need them.

Basically I'm arguing to start with a simple clean generic design and see
how far it takes us before we start adding specialized workarounds.  Also
I don't think we are wasting any effort since you want a generic table
anyway.

[ snip the MIME entities discussion as I want to stay focused on the
header caching for now.... ]

So in summary, yes the Yukatan model is nice, and has a lot of
advantages over ours, but I still think we are best served by starting
with the two table design I described earlier.  This should be
sufficiently fast and flexible that we can go a long way before we have
to special case anything.


I'll keep your idea in mind as I start building test-cases for the
insertion phase. But I still think that
having to choose between a convoluted 'if not select header_name then
insert header_name; insert
header_value;' and a simple 'insert header_name, header_value' for each
header of each message inserted, the
latter seems to make more sense for now.


That query is not convoluted, yes it's not quite as simple as blinding
inserting header_name and header_value.  However that nominal amount of
complexity reduces I/O since the header_list table will be populated with
99% of the headers that a mail server is going to see after it's been up
and running for a while.  It also significantly cuts down on storage space
required to store the header data, and it greatly increases the search
speed.  I don't see a downside here.

But I'm always open to arguments and willing to change my mind (as-if my

wife would say :-)

:-)

Thanks for helping me think this through a little.


No problem, I'm enjoying the debate.  I just want DBMail to be the best it
can be, and I want start with good clean simple design.  We already have

enough code that is impossible to manage :-)

Matthew

ps: BTW, I am willing to write some code to help make this happen, or work
with you on the code you are writing.  Just let me know.

Re: [Dbmail-dev] header storage schema changes. (W)Here we (may) go.

Reply via email to