> >>>CREATE TABLE headers (
> >>>   id SERIAL PRIMARY KEY,
> >>>   name TEXT NOT NULL
> >>>);
> >>>CREATE UNIQUE INDEX headers_name_udx ON headers(name);
> >>>
> >>>CREATE TABLE header_msg_map (
> >>>   id SERIAL8 PRIMARY KEY,
> >>>   physmessage_id INT NOT NULL,
> >>>   header_id INT NOT NULL,
> >>>   value TEXT NOT NULL,
> >>>   hdr_order INT2 NOT NULL,
> >>>   FOREIGN KEY (header_id) REFERENCES headers(id)
> >>>);
> >>>CREATE INDEX header_msg_map_idx ON header_msg_map(physmessage_id, 
> >>>header_id);

> Also, remember that you add several bytes to the database by adding
> the id in both tables (8 bytes each) plus the separate indexes on
> both of id. (Btw, I ussume you meant SERIAL8 in both tables).

Hmm, I see I was wrong on the SERIAL8. You use id(4) and header_id(4)
for the join. That halves the extra size of course.

> So unless you can save _on_average_ atleast
> (16 bytes + index + table overhead) then it's actually going to increase
> database size. And how many header names are 16+ bytes? Not many..
Actually it'd only be 4bytes (double think'o).


Doing some calculations:
(no numbers for indexes, excluding header_msg_map.id addition
and hdr_order and disregarding compression and table overheads)

Lets say we have 100 mails like a random I picked in my mailbox
now, each has:
recieved
recieved
recieved
from
to
subject
date
organization
message-id
mime-version
content-type
content-transfer-encoding
x-priority
x-msmail-priority
x-mailer
x-mimeole
importance
xrbl-warning
x-note
x-spam-tests-failed
x-rcpt-to

-- For 1 message:

 With one table:
  212 bytes for name
  +index on (physmessage_id,name)

 With two tables:
  196 bytes for headers.name
  76 bytes for headers.id
  +index on (name,id)
  4 bytes for header_id
  +index on (physmessage_id,header_id)

 So, for one message this is a clear loss.

-- For 100 messages:

 With one table:
  21200 bytes for name
  +index on (physmessage_id,name)

 With two tables:
  196 bytes for headers.name
  76 bytes for headers.id
  +index on (name,id)

  400 bytes
  +index on (physmessage_id,header_id)

 So this is a very clear gain.

This makes a clear case against my earlier suggestion
and in favor of yours. So lets go with yours.

But I'd remove the id in header_msg_map as this adds
a whole lot to each header and is of no real use as far
as I can see. There would likely be no need to ever update
or delete a single header entry. Deletion would be done
referencing physmessage_id instead. This further reduces
the database size.

hdr_order is a nice addition although I don't think it's
of much use unless we remove the raw header, and that would
kill backwards-compatability for sure.

If you add the indexes I mentioned in the example above
and remove the id I think we have a winner.

Sean: Sorry for mis-reading and questioning your work.


-Dead2 who really needs to stop using strong painkillers


Reply via email to