I beat on mairix 0.20 to get rid of some of the bugs, adjusted the shift bits to give myself a 24-bit message count, and managed to make it index most of my 1.6 million emails in mbox form:
Wrote 1641908 messages (32838160 bytes tables, 0 bytes text) Wrote 51 mbox headers (816 bytes tables, 2223 bytes paths) Wrote 26270528 bytes of mbox message checksums To: Wrote 329442 tokens (2635536 bytes tables, 4917712 bytes of text, 13753058 bytes of hit encoding) Cc: Wrote 98270 tokens (786160 bytes tables, 1394159 bytes of text, 2555590 bytes of hit encoding) From: Wrote 449652 tokens (3597216 bytes tables, 7126797 bytes of text, 14880802 bytes of hit encoding) Subject: Wrote 173835 tokens (1390680 bytes tables, 1833769 bytes of text, 14841880 bytes of hit encoding) Body: Wrote 6543874 tokens (52350992 bytes tables, 108914569 bytes of text, 304525979 bytes of hit encoding) Attachment Name: Wrote 8061 tokens (64488 bytes tables, 142214 bytes of text, 66916 bytes of hit encoding) (Threading): Wrote 1190729 tokens (9525832 bytes tables, 54171674 bytes of text, 15753681 bytes of hit encoding) I had to split one mbox into two 1GB halves to avoid out-of-memory errors. The mairix program hit over a million page faults on my 1.5GB system. top - 00:38:05 up 37 days, 23:01, 7 users, load average: 8.86, 6.39, 5.98 Tasks: 253 total, 2 running, 246 sleeping, 5 stopped, 0 zombie Cpu(s): 5.6%us, 3.3%sy, 0.7%ni, 0.0%id, 88.5%wa, 0.7%hi, 1.3%si, 0.0%st Mem: 1555952k total, 1502380k used, 53572k free, 3844k buffers Swap: 2939884k total, 1548100k used, 1391784k free, 233388k cached PID USER NI VIRT RES S %CPU %MEM TIME+ SWAP nFLT COMMAND 3983 idallen 0 2119m 1.2g D 1.0 80.9 16:20.76 891m 1.1m mairix I have some old email that has a "From " line but no "From:" or "Date:" lines. I don't see an easy way to make mairix use the values from the "From " line in place of the missing "From:" and "Date:". A topic for future work. I see there is an updated version. I'll roll forward to that at some point. The new nvp.c still doesn't do bounds checking when filling the name/value arrays. I have at least one message that faulted there. The split_and_splice_header() will walk outside the message boundary looking for a blank line and the start of the message body, and that can cause a fault. I think it needs to be limited to look within the passed-in message length. (That "got null character" is probably caused by walking off the end of the file mapping?) In several places sizeof("string") is used thinking it returns the same length as strlen("string") - it doesn't, and you have to subtract one for the NUL byte at the end. valgrind wasn't too happy with the program. I don't know enough about what it says to comment or fix it yet. A sample on a small run: ==1593== LEAK SUMMARY: ==1593== definitely lost: 98,885 bytes in 4,921 blocks. ==1593== indirectly lost: 1,401,842 bytes in 3,066 blocks. This is a great program. -- | Ian! D. Allen - [EMAIL PROTECTED] - Ottawa, Ontario, Canada | Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/ | College professor (Open Source / Linux) via: http://teaching.idallen.com/ | Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Mairix-users mailing list Mairix-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mairix-users