Re: [rlug] mdir sau mbox

Luke Skywalker Tue, 22 Aug 2006 00:46:54 -0700

On 8/17/06, Teo <[EMAIL PROTECTED]> wrote:

Salut,


imi puteti spune va rog, daca este mai avantajos a se folosi mdir sau
mbox in conditiile in care se foloseste
postfix+clamav+amavis+spamassassin?
Care este diferenta intre cele 2 alternative? Performanta?
Flexibilitate? Integritate?
Daca se foloseste mdir, cum susnt structurate mail-urile? Se face cate
un director pt fiecare mesaj?


Multumesc!

_______________________________________________
RLUG mailing list
RLUG@lists.lug.ro
http://lists.lug.ro/mailman/listinfo/rlug



Postez un articolas interesant care face o comparatie intre stocare de
mailuri in sistem de fisiere si stocare de mailuri in dburi (mbox,
exchange, etc) si prezinta de ce este mult mai avantajoasa prima
solutie:


****

http://www.memoryhole.net/~kyle/databaseemail.html

Database Back-end for Email
The (mostly) bad idea that just won't die
Introduction
There has been a recent discussion on the qmail mailing list regarding
using a relational database (a.la. MySQL or Postgres or Oracle) as the
back end storage for email. This is an idea that comes up many times,
in many contexts, and I am writing this to hopefully educate those who
have come up with the same idea again. Thanks to Charles Cazabon and
the rest of the qmail mailing list for the ideas contained here.

The primary motivation typically behind desires to store email in a
database tends to be a desire to search through email quickly. This is
an understandable desire, and the urge to go to a database is obvious:
databases are built for searching. Delivering mail to a database has
been done many times by many people. One example is the DBMail
project. Another example is the Microsoft Exchange server, which uses
a relational database to store everything.

However, storing email in a database is, generally, a bad idea UNLESS
you have very particular, very unusual, very specific requirements
that justify using a database and which overcome the drawbacks
inherent in using a database. Storing email in a database rather than
a file-system is usually a bad idea for many reasons, which tend to
fall into two categories: Databases Are Bad for Email, and Email is
Bad for Databases.

Databases Are Bad for Email
In general, a good email system must have certain specific characteristics:

Reading email must be fast. (goto)
Many users must be able to access their mail at the same time, quickly. (goto)
Backup must be easy. (goto)
Corruption must be easy to recover from. (goto)
Let's take them one at a time:

1. Reading email must be fast.
In a file-system using (for example) Maildir format for storing
messages, the way reading an email works is: I send a request for a
message to (for example) my POP3 server, which trivially constructs
the path of the file I want, and asks the OS for it. The OS fetches
the message (say, mmap()'s it into the server's process space), and
returns, at which point the POP3 server sends it to me (which
transfers the data back into the OS to talk to the network). The speed
of this operation can be significantly improved if the server and OS
both support the sendfile() interface. If they can, then this
operation is much simpler and MUCH faster: the POP3 server tells the
OS to send the file containing the email to me, and the OS returns to
the POP3 server when it has been done.

This operation (reading an email) is more difficult if the message is
stored in even a simple database (such as an mbox-format file, or an
sqlite file). In such a case the POP3 server is forced to do several
disk accesses (going in and out of the kernel several times) to find
out first where the message is stored in the file by reading the
database meta data, and then fetching the message out of the file into
a local buffer, and then sending that buffer out to the client.
Reading this meta data is slow even if the entire database has been
mmap()'d into the server's memory space, because the whole thing will
most likely not fit into cache or memory at the same time, a condition
which is exacerbated if there are many users trying to read their
email at the same time. The more complex the database is, the more
complex this operation becomes, and thus the slower it becomes.

2. Many users must be able to access their mail at the same time, quickly.
The last word of this requirement, "quickly," is the difficult part,
of course. Generally, which solution is best depends on what kind of
concurrency you're going to need to handle. The solution is different
if your user base is 2,000 people checking their email occasionally
versus 50,000 people whose clients check their email every five
minutes.

It's possible to use a poor OS/file-system combination with this,
though this sort of thing gets down to very basic sysadmin training.
For example, many OS's (e.g. Linux before 2.6) use a single lock for
each file-system. This makes dealing with concurrent access slower
(and harder) than it needs to be. Choosing a good OS/file-system
implementation can improve the available concurrency quite a bit. For
example, Linux 2.6 with an XFS partition is capable of far higher
concurrency than Linux 2.4 with EXT2 partition. Storing your email on
a RAID system can also help improve the available concurrency,
depending on what variety of RAID you use (RAID 1 is better than RAID
0, for example, because it allows multiple requests to use different
disks for their accesses.

With databases, the questions you need to ask are very similar, and
just as file-system concurrency depends on the file-system, database
concurrency depends on the database. However, the documentation is
often harder to come by. There are additional important questions for
databases, such as: what's the format of the back-end, what's the
front-end like, and what kind of concurrency does it support? What are
the performance characteristics of the database? Is it optimized for
insertion, deletion, or for selection? A big database, like Oracle, is
going to support more concurrent accesses than something like sqlite,
but is likely to require heavy-duty hardware in order to do it.

3. Backup must be easy.
Backing up a file-system is a very well studied procedure, and is
basically trivial. There's the basics of backup (tar, rsync, etc.),
and then the more advanced solutions (amanda, etc.).

Backing up a database is significantly more of a pain, especially if
you want to be able to do it without taking the database off-line
temporarily or want to be able to restore from backup very easily and
very selectively. The reason is, of course, the database back-end file
is not guaranteed to be in a consistent state at any time while the
database is running: the database front-end caches LOTS of things that
it hasn't necessarily written back to disk. the back-end file is, in
busy databases, almost never in a consistent state.

The easiest safe way to do a backup of, for example, MySQL is to use
the "mysql-dump" utility, which outputs the database as a text file
full of all of the SQL statements necessary to regenerate the database
from scratch. This is not exactly convenient if you want to do
something like recover a single message from the backup. Some
databases give you things like replication and such, and have complex
solutions to the backup problem that function much better than
mysql-dump. These solutions are, without exception, extremely
complicated and tend to require significant hardware investments (they
tend to be part of "high availability" packages).

4. Corruption must be easy to recover from.
There are times when unexpected events, or software bugs, cause data
corruption of some kind. A typical example is a power outage or surge,
causing your system to shut down suddenly. This sort of thing causes
problems primarily because of useful speed enhancements like
write-caching. In a file-system, both the meta-data (i.e. information
about the files stored) and the files themselves may become out of
sync, in which case the corruption may be unrecoverable. Fortunately,
this is a well-studied problem for file-systems, and there are several
common and easy solutions. For example, journaling file-systems
(available for all good server operating systems) ensure that the
meta-data is never in an unrecoverable inconsistent state.
Additionally, in an email system based on the Maildir storage format,
messages are not considered "delivered" until they are fully flushed
to disk, making them essentially crash-proof. With a well-designed
mail server (such as qmail), even the mail queue is protected against
unexpected crashes, which covers all of the places where mail may be
lost unexpectedly.

Even if, say, you were using something like an mh back-end or a simple
database like an mbox back-end (assuming you're using a journaled
file-system), the worst that can happen is that messages in the middle
of delivery (in the mh case) or an entire mailbox (in the mbox case)
can be corrupted. The rest of the system is isolated from corruption.
Additionally, on most OSs, disk caches get flushed every couple of
minutes, so the potential corruption is limited even further.

If power goes out on a database, however, you are at significantly
more risk. More so than the OS, databases cache many many things, all
of which vanish when the power goes out. For the same reason backups
are hard, this is essentially instant corruption of your database,
which means you're going to have to rely heavily on a high-quality,
frequent, backup system for any sort of reliability through unexpected
events (power outages, kernel panics, what have you). There's been
some work on improving this problem with databases by stealing ideas
from file-systems (namely, journaling), but it's still a severe
problem in most databases. The only other way to address unpredictable
events (which is a losing game) is to go to extreme extents like
putting your system on battery backups, in an underground waterproof
chamber somewhere far from a seismic fault.

File-systems are also more convenient than databases for dealing with
problems. Namely, there are many filtering, parsing, searching, and
editing utilities for email files that have been available and refined
for decades. Storing mail in a database means that you cannot rely on
anyone else's solution for pretty much anything tricky. DBMail, for
example, gives you utilities for putting mail into the database and
for getting mail back out, but doing something like correcting
messages that got mangled by a poorly implemented filter, or searching
for a nonstandard header (to name a few things off the top of my
head), is something you'd have to write and debug your own tools for.

Email is Bad for Databases
While databases are not good for storing email (or at least, aren't as
good as file-systems at doing so), email is not particularly good for
storing in databases, for several reasons.

Perhaps the most understandable of these reasons is that email is not
relational data. While email does have some structured meta-data (like
sender, receiver, subject, date, and so forth), it's general structure
is as a blob of unstructured text. Standard files in file-systems have
a similar amount of meta-data (name, access times, modify times,
owner, etc.), and are similarly unsuited for databases.

The content of email is essentially unstructured, as I said. This sort
of storage is not what relational databases are designed to handle
efficiently. The BLOB datatype in relational databases is intended for
very limited use, such as encoding a small icon for a data type. While
it can be used as the only thing that is stored in a database,
databases are not structured for storing generic BLOBs efficiently.

The typical size for a frequently used database is in the several
hundred megabytes. I worked on a system using a database for tracking
ticket sales for moderately busy theaters, and the database took
several months to get above a hundred megabytes. The volume of email,
on the other hand, is typically measured in gigabytes. Even a modestly
active user can easily accumulate several gigabytes of mail. While
there are databases that can handle multiple terabytes of data (such
as Oracle), they all require serious hardware, complex hardware, and
lots of administration to keep running at their peak.

But What About Searching Speed?
When trying to solve a problem like "searching speed," you should not
merely cast about and grab any random buzzword (like "database") and
throw our your existing solution in favor of this buzzword. Instead,
first carefully examine your current system and determine why it is
being slow. What is the common operation that is slow (and no, I don't
mean "searches", I mean things like "searching for the word X in the
body of all messages" versus "searching for all mails from user X").
What is the bottleneck? What is the limiting factor?

If your method of accessing the email is a POP3 server, your options
are limited: by the POP3 protocol, you can only fetch messages, and
then search them on your own time. Storing those messages, on the
server, in a database doesn't help with that. Where things like
"searching speed" become issues is when the mail is stored on the
server, as with an IMAP server.

In general, there are several things that might make IMAP searching
slow. The most common problem: you're not actually using IMAP's SEARCH
command, but are instead fetching every message and searching it by
hand (treating the IMAP server like a POP3 server). This is a common
behavior of many mail clients, in large part because messages are
often encoded in strange and unusual ways (PGP-encryption, Base64
encoding, and unusual charsets (like UTF-8) come to mind). A database
back-end will not help with this, just as it will not help with the
POP3 protocol. Using a different mail client, on the other hand, will
help. On the other hand, there are valid reasons not to use the IMAP
SEARCH command; one of them being those unusually encoded emails. Most
IMAP servers simply treat mail as blobs of ASCII text (rather than
decoding each one into an internal representation that can handle all
possible unusal characters).

There are other limitations to consider. Maybe your file-system isn't
designed for lots of random accesses; for example, perhaps you should
move to an EXT3 or XFS or ReiserFS file-system instead of FAT32 or
NTFS. Maybe your file-system isn't set up properly; for example,
turning on read-caching, increasing the size of the read-cache, and/or
turning on directory hashing/indexing will usually help a great deal.
Maybe your OS has a poor IO scheduler and you'd be better off
switching to a different one; for example, OpenBSD's IO system is very
slow, whereas Linux's can be tailored to the types of access patterns
on your system. Another possibility is that your storage system
(disks, RAID, SAN, NAS, etc.) is the bottleneck: perhaps getting a
faster one (or configuring the one you have differently) is the real
answer. The default drive settings (i.e. hdparm) on Linux are not
always optimal; you may not be using the fastest RAID level for your
needs, and SAN/NAS storage may have it's own problems (maybe you need
to upgrade to a gigabit or faster link between your server and the
storage device).

Finally, examine your IMAP server software itself. It may not be able
to handle the concurrency you want, or it may simply not be configured
properly to handle what you want it to do: never trust the defaults,
if your goal is to extract every last drop of performance. Also, many
IMAP servers have very simplistic SEARCH implementations; for example,
BincIMAP's SEARCH implementation is roughly equivalent to grep. You
may benefit from more advanced indexing and search caching techniques,
such as those found in CyrusIMAP.

Conclusion
In other words, if you're trying to store things that can be stored as
files (like email) in a safe, fast, reliable way, you need a really
REALLY good reason to avoid using a FILE-system. File-systems have
been refined and improved since the dawn of disk drives to do one
thing really well: store files. There's basically nothing better
suited for the job.

Databases can be used to improve the speed of searching mail, but that
doesn't mean that the database must be used on the mail server. The
database can be used as a cache of mail meta-data. For example, the
mutt mail client does this, to great effect.

In any case, if the answer to your email question is to replace the
back-end with a relational database, chances are you're asking the
wrong question.

Acknowledgements
Contributors to this page:

Charles Cazabon
Karl Vogel

_______________________________________________
RLUG mailing list
RLUG@lists.lug.ro
http://lists.lug.ro/mailman/listinfo/rlug

Re: [rlug] mdir sau mbox

Raspunde prin e-mail lui