I beat on mairix 0.20 to get rid of some of the bugs, adjusted the shift
bits to give myself a 24-bit message count, and managed to make it index
most of my 1.6 million emails in mbox form:

Wrote 1641908 messages (32838160 bytes tables, 0 bytes text)
Wrote 51 mbox headers (816 bytes tables, 2223 bytes paths)
Wrote 26270528 bytes of mbox message checksums
To: Wrote 329442 tokens (2635536 bytes tables, 4917712 bytes of text,
   13753058 bytes of hit encoding)
Cc: Wrote 98270 tokens (786160 bytes tables, 1394159 bytes of text, 2555590
   bytes of hit encoding)
From: Wrote 449652 tokens (3597216 bytes tables, 7126797 bytes of text,
   14880802 bytes of hit encoding)
Subject: Wrote 173835 tokens (1390680 bytes tables, 1833769 bytes of text,
   14841880 bytes of hit encoding)
Body: Wrote 6543874 tokens (52350992 bytes tables, 108914569 bytes of text,
   304525979 bytes of hit encoding)
Attachment Name: Wrote 8061 tokens (64488 bytes tables, 142214 bytes of
   text, 66916 bytes of hit encoding)
(Threading): Wrote 1190729 tokens (9525832 bytes tables, 54171674 bytes of
   text, 15753681 bytes of hit encoding)

I had to split one mbox into two 1GB halves to avoid out-of-memory errors.
The mairix program hit over a million page faults on my 1.5GB system.

top - 00:38:05 up 37 days, 23:01,  7 users,  load average: 8.86, 6.39, 5.98
Tasks: 253 total,   2 running, 246 sleeping,   5 stopped,   0 zombie
Cpu(s):  5.6%us,  3.3%sy,  0.7%ni,  0.0%id, 88.5%wa,  0.7%hi,  1.3%si,  0.0%st
Mem:   1555952k total,  1502380k used,    53572k free,     3844k buffers
Swap:  2939884k total,  1548100k used,  1391784k free,   233388k cached

  PID USER      NI  VIRT  RES S %CPU %MEM    TIME+  SWAP nFLT COMMAND          
 3983 idallen    0 2119m 1.2g D  1.0 80.9  16:20.76 891m 1.1m mairix            

I have some old email that has a "From " line but no "From:" or "Date:"
lines.  I don't see an easy way to make mairix use the values from the
"From " line in place of the missing "From:" and "Date:".  A topic for
future work.

I see there is an updated version.  I'll roll forward to that at some point.

The new nvp.c still doesn't do bounds checking when filling the name/value
arrays.  I have at least one message that faulted there.

The split_and_splice_header() will walk outside the message boundary
looking for a blank line and the start of the message body, and that
can cause a fault.  I think it needs to be limited to look within the
passed-in message length.  (That "got null character" is probably caused
by walking off the end of the file mapping?)

In several places sizeof("string") is used thinking it returns the same
length as strlen("string") - it doesn't, and you have to subtract one
for the NUL byte at the end.

valgrind wasn't too happy with the program.  I don't know enough about
what it says to comment or fix it yet.  A sample on a small run:

==1593== LEAK SUMMARY:
==1593==    definitely lost: 98,885 bytes in 4,921 blocks.
==1593==    indirectly lost: 1,401,842 bytes in 3,066 blocks.

This is a great program.

-- 
| Ian! D. Allen  -  [EMAIL PROTECTED]  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Open Source / Linux) via: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Mairix-users mailing list
Mairix-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mairix-users

Reply via email to