Re: [Dovecot] Please advise on very fast search

2011-11-17 Thread Stan Hoeppner
On 11/15/2011 1:02 PM, Timo Sirainen wrote:
 On Tue, 2011-11-15 at 12:26 -0600, Stan Hoeppner wrote:
 
 This is why I recommended mbox in the first place.  If your only writes
 to these mailbox files are appends of new messages, mbox is the best
 format by far.  It's faster at appending than any other format, and it's
 faster for searching than any other.
 
 Just as long as you're not simultaneously trying to read and write the
 mbox file (or just write in 2+ sessions). Then there's a lot waiting on
 locks. (mdbox has no read locks, and its write locks are very short
 lived.)

Of course.  My understanding of Alexander's workflow is that copies of
all daily new mail are written to an IMAP mailbox via some MTA bcc rule
or sieve rule.  A nightly script moves the daily mail to another mailbox
created and named by date.  These named mailboxes are then used for
backup and the search function, but are never written to again.  So I
assume there is no simultaneous read/write of the archive mailboxes he
performs searches on.  It's possible I don't fully understand
Alexander's work flow yet.

-- 
Stan


Re: [Dovecot] Please advise on very fast search

2011-11-16 Thread Stan Hoeppner
On 11/16/2011 12:15 AM, Alexander Chekalin wrote:
 Hello, Stan,
 
 This is why I recommended mbox in the first place.  If your only writes
 to these mailbox files are appends of new messages, mbox is the best
 format by far.  It's faster at appending than any other format, and it's
 faster for searching than any other.
 
 I now seriously consider to use mdbox due to its nice self-regulation.
 After all it I believe mdbox should do file compression on its own, no
 cron scripts required.

mbox and mdbox each has strengths and weaknesses.  mbox will compress
with a higher ratio than mdbox.  You already have a nightly script that
moves all mail from the day into a new file.  Piping that through gzip
or bzip2 is a no brainer.  It'll add one line to your existing script,
if that.  Dovecot will decompress the file transparently when you access
it via IMAP.  And again since it's a single file searching it is much
faster.  With mbox you will have a single file for each day of emails.
This seems ideal for archive purposes, one file per day.

mdbox does fully transparent de/compression which is nice.  The downside
is that Dovecot does dbox compression on a per email basis, not a per
file basis.  So your compression ratio will be much less than with mbox,
especially with bzip2 which works best on files over 900KB in size.
Most emails are less than 8KB.  Using mdbox will yield multiple files
per day of emails instead of just one.

Either format is much better than maildir for archiving.

 It's an archive.  You're not going to use maildir so you don't need
 random IOPS performance.  Thus RAID5/6 are a much better fit for an
 archive as you get better read performance, with more than adequate
 write performance, and you use less disks.  And as this is an archive,
 you don't need real time automatic/transparent compression.  Thus I
 recommend something like:

 1.  Debian 6 w/linux-image-2.6.39-bpo.2-amd64 or a custom rolled
  2.6.39 or later kernel
 2.  hardware RAID5 w/large (2TB) SATA disks, 512B native sectors
  e.g. MegaRAID SAS 9261-8i, 4 Seagate Constellation ES ST2000NM0011
  Specify a strip size of 256KB for the array
  Perma set /sys/block/sdX/read_ahead_kb to 512 so you're reading
  ahead 1024 sectors at a time instead of the default of 256.  This
  will speed up your searches quite a bit.
 3.  XFS filesystem on the RAID device, created with mkfs.xfs defaults
 4.  mbox w/zlib plugin.  Compress daily files each night with a script
 5.  You don't need LVM with a good RAID card (or with mdraid).  This
  controller can expand the RAID5 up to 8 drives (up to 32 drives max
  using SAS expanders)
 
 We are considering to get HP DL180G6 server for 8 or 14 drives bays

The P410 tops out at 8 drives, so get the 8 drive model.  Start with 4 x
2TB drives in RAID5.  Add 4 more drives when you need the capacity, and
when drive prices are back down to normal (see below).

http://h18004.www1.hp.com/products/quickspecs/13248_na/13248_na.html

 (base model price is somewhat equal, but additional drives adds up cost)

Especially right now in 2011.  Flooding in Thailand, where 25% of the
world's drives are produced, has doubled the cost of all hard drives
worldwide.  Now is a horrible time to buy spinning drives.  I've read it
may be 12 months before prices start coming back down...

 with HP Smart Array P410 RAID controller (some servers are equipped with
 this controller by default) with 256 Mb battery-backed cache, but I'll
 check your suggestions!

The P410 should be fine for a dedicated archive server.

 What memory size should I plan in the server? You're talking about AMD64
 OS image, and 64-bit software are like to consume more memory that
 32-bit, so looks like your talking about pretty huge RAM, and I don't
 believe it's necessary, or maybe I'm wrong?

The memory footprint of 64bit binaries is nothing to worry about.  The
additional amount consumed is more than offset by the performance gained
with direct access to RAM above 4GB compared to the performance of PAE.

Keep in mind that 90% of your memory will be eaten by Linux buffer
cache.  Your binaries will account for less than 5% of your
RAM consumption.  If I understand correctly how you will use this
archive server, then 8GB should be plenty.  8GB is standard on the 8
drive DL180 G6.

http://h18004.www1.hp.com/products/quickspecs/13248_na/13248_na.html

 Problem is I have no experience with XFS and not sure I can tune it in
 the best way, so I'll go with mkfs.xfs defaults, I think.

With only 4 drives and using a P410 w/cache and RAID5, doing manual XFS
tuning isn't necessary for good performance, especially for an archive
application which is data heavy, not metadata heavy.  Setting
sunit/swidth to match the RAID5 layout may increase performance slightly
due to stripe aligned writes, but not enough that I'd worry about it.
Just use the mkfs.xfs defaults.  If you get the BBWC for the P410,
enable the controller write cache, and mount 

Re: [Dovecot] Please advise on very fast search

2011-11-15 Thread Stan Hoeppner
On 11/14/2011 3:16 PM, Alexander Chekalin wrote:
 Locking issues on mbox is the reason for my long-lasting love affair with 
 maildir, 

Same reason most others fell in love with it.  Many now want to divorce
maildir, as the cost of the storage to maintain acceptable performance
is now too high.

 and it's lasts long years. Ok, the life's lessons are like this, learn 
 something and move on with it ;) even if it's new old thing. Thank you for 
 pointing that!

Many old UNIX gurus still use mbox, not maildir, and never will.  If you
ask them why they'll likely say you don't use a screwdriver to drive a
nail do you?

 What I was doubt about is default rotate size of 2M, since I used to see 
 pretty reasonable default settings in all Dovecot config. 32 or 64 are much 
 close to the ones I'd personally prefer.

Given the fact that we're talking about an archive server, you'd be
better off using a very large mdbox file size, say 1GB.  You're never
deleting individual messages from this archive correct?  No expunges?

This is why I recommended mbox in the first place.  If your only writes
to these mailbox files are appends of new messages, mbox is the best
format by far.  It's faster at appending than any other format, and it's
faster for searching than any other.

 I also about to choose now is the OS and FS for the archive. I seriously 
 think about ZFS with compression (in fact it will be stripes over couple of 
 mirrors = software equivalent of RAID 10 on SATA drives, with compression on 
 FS level) on FreeBSD, or XFS over LVM on Debian with compression in mdbox 
 itself. I see pros and contras for both, so that's the question to answer!

It's an archive.  You're not going to use maildir so you don't need
random IOPS performance.  Thus RAID5/6 are a much better fit for an
archive as you get better read performance, with more than adequate
write performance, and you use less disks.  And as this is an archive,
you don't need real time automatic/transparent compression.  Thus I
recommend something like:

1.  Debian 6 w/linux-image-2.6.39-bpo.2-amd64 or a custom rolled
2.6.39 or later kernel
2.  hardware RAID5 w/large (2TB) SATA disks, 512B native sectors
e.g. MegaRAID SAS 9261-8i, 4 Seagate Constellation ES ST2000NM0011
Specify a strip size of 256KB for the array
Perma set /sys/block/sdX/read_ahead_kb to 512 so you're reading
ahead 1024 sectors at a time instead of the default of 256.  This
will speed up your searches quite a bit.
3.  XFS filesystem on the RAID device, created with mkfs.xfs defaults
4.  mbox w/zlib plugin.  Compress daily files each night with a script
5.  You don't need LVM with a good RAID card (or with mdraid).  This
controller can expand the RAID5 up to 8 drives (up to 32 drives max
using SAS expanders)

-- 
Stan


Re: [Dovecot] Please advise on very fast search

2011-11-15 Thread Timo Sirainen
On Tue, 2011-11-15 at 12:26 -0600, Stan Hoeppner wrote:

 This is why I recommended mbox in the first place.  If your only writes
 to these mailbox files are appends of new messages, mbox is the best
 format by far.  It's faster at appending than any other format, and it's
 faster for searching than any other.

Just as long as you're not simultaneously trying to read and write the
mbox file (or just write in 2+ sessions). Then there's a lot waiting on
locks. (mdbox has no read locks, and its write locks are very short
lived.)




Re: [Dovecot] Please advise on very fast search

2011-11-14 Thread Alexander Chekalin

Timo, Stan,

I've just tested mdbox and find it pretty nice for me, but now I got 
some questions for you:


1. mdbox uses 'a lot' files (m.1, m.2 ... etc), and the default size if 
2Mb. Looks like not even every message can fit into such storage 
container volume (nowadays we used to see messages of 20Mb and even 
more). Should I tune it (at least mdbox_rotate_size and 
mdbox_rotate_interval) or its size is on purpose? As for now I store 
each day's messages in separate IMAP folders (mailboxes), which gives me 
2000-6000 messages and 2-5 Gb (on disk) per folder.


2. I can use no compression, gz and bz2 - which one will be better for 
storing archive messages? I've just tested mdbox by copying 5800+ msgs 
from maildir to compressed mdbox, and it took exactly the same size (2.8 
G) in 100+ small m.* files. No good as far.


3. What if I use maildir as I do now but turn on compression, will this 
speed things up?


I'd like to use mdbox as storage but for now it is very new for me and I 
simple afraid what should I do if I'll need to manually fix the storage 
(maildir is really good for that, surely).


After all, I simple need to speed up the search and restore process in 
archive.


Yours,
  Alexander


Re: [Dovecot] Please advise on very fast search

2011-11-14 Thread Alexander Chekalin
Locking issues on mbox is the reason for my long-lasting love affair with 
maildir, and it's lasts long years. Ok, the life's lessons are like this, learn 
something and move on with it ;) even if it's new old thing. Thank you for 
pointing that!

What I was doubt about is default rotate size of 2M, since I used to see pretty 
reasonable default settings in all Dovecot config. 32 or 64 are much close to 
the ones I'd personally prefer.

I also about to choose now is the OS and FS for the archive. I seriously think 
about ZFS with compression (in fact it will be stripes over couple of mirrors = 
software equivalent of RAID 10 on SATA drives, with compression on FS level) on 
FreeBSD, or XFS over LVM on Debian with compression in mdbox itself. I see pros 
and contras for both, so that's the question to answer!

Yours, Alexander

 On 11/14/2011 8:35 AM, Alexander Chekalin wrote:
 Timo, Stan,
 
 I've just tested mdbox and find it pretty nice for me, but now I got
 some questions for you:
 
 1. mdbox uses 'a lot' files (m.1, m.2 ... etc), and the default size if
 2Mb. Looks like not even every message can fit into such storage
 container volume (nowadays we used to see messages of 20Mb and even
 more). Should I tune it (at least mdbox_rotate_size and
 mdbox_rotate_interval) or its size is on purpose? As for now I store
 each day's messages in separate IMAP folders (mailboxes), which gives me
 2000-6000 messages and 2-5 Gb (on disk) per folder.
 
 mdbox_rotate_size of 2MB is too small for your needs.  Test 32MB and 64MB.
 
 2. I can use no compression, gz and bz2 - which one will be better for
 storing archive messages? I've just tested mdbox by copying 5800+ msgs
 from maildir to compressed mdbox, and it took exactly the same size (2.8
 G) in 100+ small m.* files. No good as far.
 
 bzip2 may give you a little better compression but at the cost of much
 lower de/compression speed and higher CPU and memory consumption.  gzip
 will be faster all around, between 4x-8x, with lower mem usage, but with
 less compression resulting in slightly larger file sizes than bzip2.
 
 3. What if I use maildir as I do now but turn on compression, will this
 speed things up?
 
 No.  Maildir performance is limited by the disk head actuator speed,
 which is between 150-300 seeks per second depending on your disk (7.2k
 vs 15k RPM).  Compressing the files doesn't change the seek physics of
 the disk drives.  You're still reading tens of thousands of files when
 doing your searches thus bouncing the heads tens of thousands of times.
 
 mbox uses a single file, so head speed isn't a factor, as it may only
 move a few times when reading an entire mailbox file.  Thus, bandwidth
 becomes the potential bottleneck.  Using compression with large mbox
 files can substantially increase search performance as effective
 bandwidth is increased by ~4x using gzip and 6x using bzip2.  This
 assumes you have plenty of excess CPU power.  mdbox should see similar
 compression speedups if you use file sizes much larger than the 2MB
 default.  Doing so should keep your IOPS well below the drive's head
 saturation point as you're reading only a fraction of the file count
 compared to maildir.
 
 I'd like to use mdbox as storage but for now it is very new for me and I
 simple afraid what should I do if I'll need to manually fix the storage
 (maildir is really good for that, surely).
 
 Doveadm handles such tasks pretty well.  Just make sure you keep good
 backups of your mdbox files.
 
 After all, I simple need to speed up the search and restore process in
 archive.
 
 The only way to accomplish this with maildir is with much bigger,
 faster, more expensive storage hardware.  And the gain will still be
 much less than simply switching to a larger file format such as mbox or
 mdbox.
 
 As with many things some computer technologies come full circle over
 time.  One of the reasons the creators of the UNIX mbox mail file format
 decided upon a single file many decades ago was the horribly limited
 seek performance of the slow SCSI disks of that period.  Doing something
 like the maildir format was simply impossible at that time.  In the
 early days of the public internet, disk became faster than the average
 load and maildir was born to fix the locking and corruption shortcomings
 of mbox.
 
 Today many sites are hitting the seek problem of a few decades ago
 because boxes are oversubscribed with users, emails now frequently
 contain attachments, everyone is storing more email, and the total
 volume of email is a few orders of magnitude greater.
 
 IIRC, this is one of the reasons Timo created mdbox--to decrease the
 massive IOPS load, and thus slow performance, of large maildir stores.
 
 -- 
 Stan


Re: [Dovecot] Please advise on very fast search

2011-11-14 Thread Timo Sirainen
On 14.11.2011, at 16.35, Alexander Chekalin wrote:

 1. mdbox uses 'a lot' files (m.1, m.2 ... etc), and the default size if 2Mb. 
 Looks like not even every message can fit into such storage container volume 
 (nowadays we used to see messages of 20Mb and even more).

The messages are never split into multiple files. So if you have a 20 MB 
message, it gets stored into its own m.* file.

 Should I tune it (at least mdbox_rotate_size and mdbox_rotate_interval) or 
 its size is on purpose? As for now I store each day's messages in separate 
 IMAP folders (mailboxes), which gives me 2000-6000 messages and 2-5 Gb (on 
 disk) per folder.

The main problem with larger mdbox files is that if you expunge messages, 
there's more data to write when packing the data into a new file. I don't 
really know the best value for mdbox_rotate_size setting. But even a 2 MB 
mdbox file can contain thousands of small mails, so it's not too bad..



Re: [Dovecot] Please advise on very fast search

2011-11-10 Thread Stan Hoeppner
On 11/9/2011 11:35 PM, Alexander Chekalin wrote:
 Hello, Stan,
 
 in fact the only thing I miss even with my current scheme is permanent
 ID assigned to the message so I can easily find it despite the IMAP
 mailbox it is now (so if someone moved the message from one
 mailbox/folder to another, the ID allows to retrieve it fast anyway).
 
 You see, what I need is not only find message from|to someone on
 specified date, I also sometime need to restore that message back to
 user's original box. As far our mailserver and backup-mailserver are
 different machines, it is a bit tricky to copy messages between it fast
 enough. Say, if I need to find and restore all mails from
 u...@domain.com within 2009 year, and search yields in some 1000's of
 messages, then use IMAP to copy it over to another server takes some
 time - and if you consider both search time and restore/copy time the
 whole process may take ages.

Apparently I didn't fully understand all of your requirements.

Moving the archived mail to mbox/mdbox and/or getting a good indexing
search engine installed will cut the search time down tremendously.
Whether that would make up for the time consumed with an IMAP copy of
many emails I don't know.  If your servers aren't old and slow, and are
not already overloaded, I would think the IMAP message copying over GbE
would be pretty quick, even for the 1000 messages scenario.

There may be some Dovecot tweaks that might make this copy process
faster.  Timo would need to chime in on that.  Do you perform the IMAP
transfers with a GUI IMAP client on your management PC?  Or are you
using imapsync or some other util directly on the servers?

If the former you may be able to tweak your IMAP client to speed up the
transfers as well.  Try using IMAP and not IMAPS for the transfers.
What is the network infrastructure between the servers and your
management workstation?  Is it all GbE with jumbo frames enabled?

 With maildir I can rsync/scp needed files to another host and that's
 fast way - that's why I stick with maildir.

There is definitely some flexibility here.

 FTS in my case can help (I can search for u...@domain.com, for example),
 but it also return messages that contains such a string in message body
 (and that takes index space, too), so I'll need to filter it later, but
 surely it'll be faster than checking every message in the archive.

Sure.  So you're concerned with your poor performance, but also with
disk space.  Unfortunately there's no free lunch to be had.  You'll have
to make sacrifices somewhere.  You could go with mdbox and use
compression, trading that saved space for search index files space.

-- 
Stan



Re: [Dovecot] Please advise on very fast search

2011-11-10 Thread Timo Sirainen
On 10.11.2011, at 6.37, Alexander Chekalin wrote:

 Are there any ways I can search or parse mboxes or mdboxes not directly and 
 not with IMAP (I'm afraid it slooow in dump parsing)?

See doveadm fetch / doveadm search.

 in fact the only thing I miss even with my current scheme is permanent ID 
 assigned to the message so I can easily find it despite the IMAP mailbox it 
 is now (so if someone moved the message from one mailbox/folder to another, 
 the ID allows to retrieve it fast anyway).

Dovecot has message GUIDs (with maildir it's filename), but there's no quick 
lookup for them, even though doveadm can fetch them easily:

doveadm fetch text guid 12312312



[Dovecot] Please advise on very fast search

2011-11-09 Thread Alexander Chekalin

Hello,

I try to create some kind of mail backup system. What I need is system 
that will store mail for the whole domain, and allow me to restore 
messages from/to specified email at that domain.


The scheme is pretty simple: on our main mail server the SMTP server 
itself has a rule to send a copy of every message to 
'bac...@backupserver.host', and the backupserver.host domain is placed 
nearby on second server.


The SMTP on second server do simple 'catchall' redirect of all messages 
to the single box. There is also a Dovecot that takes care for remote 
IMAP access to that box. And, finally, I've create some scripts to sort 
all messages in INBOX to folders named after message's date.


So I have a lot of mailboxes inside the catchall box:
INBOX
2011.11.03
2011.11.04
2011.11.05
2011.11.06
...etc...

and each folder holds messages for that day. Simply, and works perfectly.

The problem is that when my archive become big (several years), it 
appears to be painful to find specified message(s). When someone 
suddenly needs to find his/her old message, it is mostly guesses like 'I 
think the message was between june and july of 2009, or maybe month or 
two before that', so I need to search all mailboxes (with 1000's 
messages in each). And it takes really long time.



I tried to play with Dovecot indexes, but it won't help too much. The 
bad part is that I need to search for all emails in each message 
headers, not only for From or To, since some messages are sent to 
maillists soe To = list address, not person's personal email.


Then I tried to index messages on my own, storing info on emails into 
MySQL database ('email' - 'mailbox', 'message filename'), but soon I 
find out that message files can be renamed by Dovecot.


Could you please advice me how to speed up message search?


Sorry for such a long question, hope you can help!

Yours,
  Alexander Chekalin



Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Robert Schetterer
Am 09.11.2011 14:57, schrieb Alexander Chekalin:
 Hello,
 
 I try to create some kind of mail backup system. What I need is system
 that will store mail for the whole domain, and allow me to restore
 messages from/to specified email at that domain.
 
 The scheme is pretty simple: on our main mail server the SMTP server
 itself has a rule to send a copy of every message to
 'bac...@backupserver.host', and the backupserver.host domain is placed
 nearby on second server.
 
 The SMTP on second server do simple 'catchall' redirect of all messages
 to the single box. There is also a Dovecot that takes care for remote
 IMAP access to that box. And, finally, I've create some scripts to sort
 all messages in INBOX to folders named after message's date.
 
 So I have a lot of mailboxes inside the catchall box:
 INBOX
 2011.11.03
 2011.11.04
 2011.11.05
 2011.11.06
 ...etc...
 
 and each folder holds messages for that day. Simply, and works perfectly.
 
 The problem is that when my archive become big (several years), it
 appears to be painful to find specified message(s). When someone
 suddenly needs to find his/her old message, it is mostly guesses like 'I
 think the message was between june and july of 2009, or maybe month or
 two before that', so I need to search all mailboxes (with 1000's
 messages in each). And it takes really long time.
 
 
 I tried to play with Dovecot indexes, but it won't help too much. The
 bad part is that I need to search for all emails in each message
 headers, not only for From or To, since some messages are sent to
 maillists soe To = list address, not person's personal email.
 
 Then I tried to index messages on my own, storing info on emails into
 MySQL database ('email' - 'mailbox', 'message filename'), but soon I
 find out that message files can be renamed by Dovecot.
 
 Could you please advice me how to speed up message search?
 
 
 Sorry for such a long question, hope you can help!
 
 Yours,
   Alexander Chekalin
 

guess youre searching over imap ?
perhaps compression will help for speed up, and many other speed related
stuff, or you need some other idea of indexing
at last if its maildir how fast is grep etc...and so on
some ideas here
http://wiki.dovecot.org/HowTo/ReadOnlyArchive etc

anyway , i think you really need another kind of archive solution
in Germany there is a law that you need to archive some kind of business
mails up to 10 years for finance and other review, so there are a lot of
you can by solutions now, these have solved the problems you
discovered ( indexing etc )
i was shown i.e
http://www.bytstormail.de which looked fine to me

or
perhaps you might have a look
http://www.archiveopteryx.org/
here too

-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria


Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Timo Sirainen
On Wed, 2011-11-09 at 16:57 +0300, Alexander Chekalin wrote:

 The problem is that when my archive become big (several years), it 
 appears to be painful to find specified message(s). When someone 
 suddenly needs to find his/her old message, it is mostly guesses like 'I 
 think the message was between june and july of 2009, or maybe month or 
 two before that', so I need to search all mailboxes (with 1000's 
 messages in each). And it takes really long time.
 
 
 I tried to play with Dovecot indexes, but it won't help too much. 

They'll help with the dates.

 The 
 bad part is that I need to search for all emails in each message 
 headers, not only for From or To, since some messages are sent to 
 maillists soe To = list address, not person's personal email.

Headers only, not message body? Anyway, some of the full text search
backends would support searching from both. I'd recommend using either
Solr or with Dovecot v2.1 you can also use Lucene:
http://wiki2.dovecot.org/Plugins/FTS




Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Alexander Chekalin

Thanks, Robert,

will take a look at.

What I'm afraid for is how database storage should be planned (storage, 
CPU, RAM, scaling when will be over-filled). When dealing with files 
(I'm using maildir), it is much easy to understand and to fix just about 
everything. Adding database involves tune it up too, and I'll have more 
points of 'tune it a bit'


In fact work with Dovecot is pretty nice, but I think I can tune it to 
work faster.


I now run it on FreeBSD (on UFS2), maybe I should change OS + FS, but 
need to test (really hope ZFS disks on SAS drives will help; still find 
no benchmarks on such a setup). Will also try to use full text search, 
but afraid of index size (and I need no search on body, just on headers).


Anyway thank your for pointing me in right directions!

Yours,
  Alexander


Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Timo Sirainen
On Wed, 2011-11-09 at 19:16 +0300, Alexander Chekalin wrote:
 Will also try to use full text search, 
 but afraid of index size (and I need no search on body, just on headers).

It wouldn't be difficult to patch Dovecot to skip indexing message
bodies. Of course then you'd need to remember to keep applying the patch
when updating.




Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Stan Hoeppner
On 11/9/2011 10:16 AM, Alexander Chekalin wrote:
 Thanks, Robert,
 
 will take a look at.
 
 What I'm afraid for is how database storage should be planned (storage,
 CPU, RAM, scaling when will be over-filled). When dealing with files
 (I'm using maildir)

Bingo.^^^

Maildir is very likely a hug factor in your current slow search time.
With a maildir search, every mail file must be opened and searched.  How
many total mail files are opened for each of your searches?  Thousands?
 Tens of thousands?  Maildir causes a massive disk IO bottleneck when
searching so many files.  Run iostat the next time you do one of these
searches, and look at the %iowait value.  It will likely be very high.
If it is, this confirms maildir is a big part of the problem.

mbox, and mdbox, would be many many times faster than maildir WRT
searching as the total number of files is lower by orders of magnitude.
 Switching from maildir to mbox/mdbox shifts the workload burden from
the disk subsystem to the processor/memory.  And I'm sure as with
everyone else on the planet today, you have massive spare CPU cycles,
but extremely limited spindle throughput.

And as Timo suggested, using one of the indexing search plugins would be
much faster yet, as long as you keep the indexes updated.

-- 
Stan


Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Stan Hoeppner
On 11/9/2011 10:40 AM, Timo Sirainen wrote:
 On Wed, 2011-11-09 at 19:16 +0300, Alexander Chekalin wrote:
 Will also try to use full text search, 
 but afraid of index size (and I need no search on body, just on headers).
 
 It wouldn't be difficult to patch Dovecot to skip indexing message
 bodies. Of course then you'd need to remember to keep applying the patch
 when updating.

Also keep in mind that, in general, many/most message headers today are
often as large, or larger than, the actual message body, especially for
list mail.  Just take a look at messages from this for evidence.

Thus, I'd think that going out of your way to avoid indexing message
bodies wouldn't be worth the effort/headaches involved.

-- 
Stan


Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Alexander Chekalin
Oh, that's the point to consider. 

But I must confess I'm in love with Maildir for maybe 10 years for that simple 
fact I can do anything with each and every single message even on disk (=much 
faster than via IMAP). If I would deal with mbox directly I'd need to parse 
huge files, b.

Are there any ways I can search or parse mboxes or mdboxes not directly and not 
with IMAP (I'm afraid it slooow in dump parsing)?

10.11.2011, в 3:42, Stan Hoeppner s...@hardwarefreak.com написал(а):

 On 11/9/2011 10:16 AM, Alexander Chekalin wrote:
 Thanks, Robert,
 
 will take a look at.
 
 What I'm afraid for is how database storage should be planned (storage,
 CPU, RAM, scaling when will be over-filled). When dealing with files
 (I'm using maildir)
 
 Bingo.^^^
 
 Maildir is very likely a hug factor in your current slow search time.
 With a maildir search, every mail file must be opened and searched.  How
 many total mail files are opened for each of your searches?  Thousands?
 Tens of thousands?  Maildir causes a massive disk IO bottleneck when
 searching so many files.  Run iostat the next time you do one of these
 searches, and look at the %iowait value.  It will likely be very high.
 If it is, this confirms maildir is a big part of the problem.
 
 mbox, and mdbox, would be many many times faster than maildir WRT
 searching as the total number of files is lower by orders of magnitude.
 Switching from maildir to mbox/mdbox shifts the workload burden from
 the disk subsystem to the processor/memory.  And I'm sure as with
 everyone else on the planet today, you have massive spare CPU cycles,
 but extremely limited spindle throughput.
 
 And as Timo suggested, using one of the indexing search plugins would be
 much faster yet, as long as you keep the indexes updated.
 
 -- 
 Stan


Re: [Dovecot] Please advise on very fast search

2011-11-09 Thread Alexander Chekalin

Hello, Stan,

in fact the only thing I miss even with my current scheme is permanent 
ID assigned to the message so I can easily find it despite the IMAP 
mailbox it is now (so if someone moved the message from one 
mailbox/folder to another, the ID allows to retrieve it fast anyway).


You see, what I need is not only find message from|to someone on 
specified date, I also sometime need to restore that message back to 
user's original box. As far our mailserver and backup-mailserver are 
different machines, it is a bit tricky to copy messages between it fast 
enough. Say, if I need to find and restore all mails from 
u...@domain.com within 2009 year, and search yields in some 1000's of 
messages, then use IMAP to copy it over to another server takes some 
time - and if you consider both search time and restore/copy time the 
whole process may take ages.


With maildir I can rsync/scp needed files to another host and that's 
fast way - that's why I stick with maildir.


FTS in my case can help (I can search for u...@domain.com, for example), 
but it also return messages that contains such a string in message body 
(and that takes index space, too), so I'll need to filter it later, but 
surely it'll be faster than checking every message in the archive.


Yours,
  Alexander


Maildir is very likely a hug factor in your current slow search time.
With a maildir search, every mail file must be opened and searched.  How
many total mail files are opened for each of your searches?  Thousands?
  Tens of thousands?  Maildir causes a massive disk IO bottleneck when
searching so many files.  Run iostat the next time you do one of these
searches, and look at the %iowait value.  It will likely be very high.
If it is, this confirms maildir is a big part of the problem.

mbox, and mdbox, would be many many times faster than maildir WRT
searching as the total number of files is lower by orders of magnitude.
  Switching from maildir to mbox/mdbox shifts the workload burden from
the disk subsystem to the processor/memory.  And I'm sure as with
everyone else on the planet today, you have massive spare CPU cycles,
but extremely limited spindle throughput.

And as Timo suggested, using one of the indexing search plugins would be
much faster yet, as long as you keep the indexes updated.




--

С уважением,
  Александр Чекалин
  Лазурит
  Калининград
  +7 909 799 2549
  acheka...@lazurit.com