Re: [Dovecot] mbox vs. maildir storage block waste

Robin Thu, 08 Nov 2012 17:54:55 -0800

Obvious caveats and qualifications apply here throughout this email.

Christoph Anton Mitterer <cales...@scientia.net> wrote:
> I see... well I haven't tested AOX or dbmail so far (especially as
> they're not in Debian and I was too lazy till now to compile them)...
> 
> At least I had the impression that performance (especially in searches)
> was one of the major things these people were proud of.
> 
> 
> I'll stay tuned, whether we ever see a fully usable SQL backend for
> Dovecot :)


I wouldn't hold your breath.

It's a recurringly seductive "meme" in email circles, but the reality is that 
email is mostly unstructured data with a few fields of reasonably structured 
data (dates, from, to, maybe attachment types + filenames).  The bulk of the 
emails, and the part of the emails that people really want to search quickly: 
the body, is unstructured, and doesn't perform quickly with the stock "full 
text search" modules in the main SQL engines.

I'd given dbmail2 a try with MySQL 5, 5.5, and Postgres 8.4 and 9.1 branches.  
I've dedicated 16GB of DDR3-1800/3.4GHz 6-core AMD 1090T with hardware RAID 
local storage (12 x Seagate ES 7200RPM spindles). (64 bit Slackware 13.37 
running Linux 3.2 kernels built for the platform.)

The performance is surprisingly bad ... doing almost everything.  Searches 
through IMAP, bulk importation of mail folders, large numbers of simultaneous 
mail deliveries, you name it.  There wasn't a task that the dbmail setup 
performed faster than Dovecot, in either low or high load situations.  When I 
tossed a test load that introduced lots of mail deliveries as well as searches 
and full folder pulls, things got really pear-shaped.  Even putting dovecot's 
mailstore on NFS (GigE) didn't really slow Dovecot down enough to make dbmail 
competitive.

When pressed on this lack of performance, I was instructed to "add more RAM" to 
the DB machine, and that for ideal performance I should have more RAM than my 
mailbox sizes.  *sigh*  This sounds great for a very small installation, but 
this clearly is not something that scales.

I think the final humiliation was comparing the body + header searching 
performance using Timo's practically obsolete fts_squat plugin against 
dbmail's.  Wow.  Squat was multiple orders of magnitude faster.  Lucene and 
Solr are even moreso when fed large datasets (mail folder hives of about 
100GB).  The SQL setups hit the obvious performance shelf once they were unable 
to maintain everything in RAM or cache.

The dbmail folk are earnest and hard-working, and I don't mean to cast the 
slightest bit of negativity on their project.  I think the assumptions about 
what SQL servers can do well often doesn't square with the reality of many 
applications that people try to fit them into.

On my first initial round of tests, I imported 24,000 emails comprising a mere 
560MB of space.  Just about all of the non-SQL imap servers handled the 
importation (basically IMAP APPENDs) within 6 minutes.  dbmail2 required hours 
(using MySQL), and a bit shorter time (but still hours') with Postgres.

>From an old email:

> Searching INBOX #msgs = 24714
>  [NOFIND] Time=2.072423, matches=24714 <--- this should be zero *BUG*
>  [date] Time=2.07519, matches=24714 <--- this is correct
>  [here] Time=2.072075, matches=24714 <--- this should be about 30% of total # 
> of msgs *BUG*
> 
> Does dbmail break IMAP SEARCH TEXT (i.e., search both body + headers)?  Is 
> this a result of relying on MySQL's search algorithms in text-like fields? 
> I'm still puzzled, because I can't believe that 'here' appears in EVERY 
> email.  It looks like dbmail's returning EVERY email on a SEARCH TEXT.  This 
> is not correct operation.
> 
> When I alter the search to use "FROM" as the key instead of "TEXT", the 
> results are more discriminating and meet expectations.
> 
> Searching INBOX #msgs = 24714
>  [NOFIND] Time=2.161049, matches=0
>  [james] Time=2.273255, matches=1049
>  [here] Time=2.165406, matches=2
> 
> Not that it matters, but it's much slower than Dovecot's fts_squat for 
> substring searches.
> 
> Dovecot's fts_squat IMAP SEARCH TEXT results are:
> 
> Searching INBOX #msgs = 55731
>  [Updating Index] Time=78.184637 (66% of the mailbox unindexed at start)
>  [NOFIND] Time=0.045654, matches=0
>  [date] Time=0.13364, matches=55731
>  [here] Time=0.069091, matches=24663

FWIW, I found Postgres to be faster than MySQL (5 and 5.5, though 5.5 with a 
hand-rolled config file using metrics supplied by a dbmail/MySQL guru helped a 
great deal for size(data_set) < size(PHYSICAL MEMORY) cases.

Where lots of write-commits were involved on the same exact setup.  MySQL "got 
close" to PSQL's performance when I did crazy things like remove filesystem 
journaling, write barriers, etc on the mail db mountpoint.  Obviously, this is 
desperation talking.

I concede that the motivations behind SQLising mail storage extends to 
administration/replication and other non-performance/scalability aspects.  I 
suspect what constitutes "good enough" performance when squared against those 
other considerations may raise a SQL approach high enough for some people to 
use it.

I suspect a "NoSQL" key-value store type of database to offer much better 
performance than SQL RDBs, since most of the assumptions behind the storage and 
access patterns of email don't really fit into the SQL RDB model very 
efficiently.

dbmail's author and a couple of key dbmail users are very active and responsive 
on their mailing list, and bend over backwards to try to help new users with 
tuning and performance related problems.

I simply don't have enough of a budget for populating my DB machines with TBs 
of RAM to make it work as quickly as I need it to for my midrange mail store 
(10TB).

Good luck!

=R=

Re: [Dovecot] mbox vs. maildir storage block waste

Reply via email to