Re: [Dovecot] Spliting Folders for Efficiency

2009-06-27 Thread Timo Sirainen
On Thu, 2009-06-25 at 11:17 +0100, Daniel Watts wrote:
> Digging up this thread from 2007. Just had another conversation in my 
> company about how to spread old non-accessed files to cheaper slower 
> storage.
> 
> Is this now feasible? I noticed dbox is now v2.0 but see no reference to 
> virtual folders or auto-archiving etc.

Multi-dbox is in v2.0, but single-dbox is already in v1.1! You can
configure it to use two directories, e.g.:

mail_location = dbox:~/dbox:ALTPATH=/cheapstorage/%h

v2.0 implements dbox somewhat better, but v1.1's version should work
well enough too. v1.1 just creates a pretty useless dbox.index file and
also writes (for backup) flag changes to dbox files once in a while.

Moving old messages can be done using expire plugin:
http://wiki.dovecot.org/Plugins/Expire#Alternative_dbox_directory_expiration or 
you can do it manually with mv command as well.

With the above configuration there's no need to use virtual mailboxes.


signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] Spliting Folders for Efficiency

2009-06-25 Thread Daniel Watts

Timo Sirainen wrote:

On Thu, 2007-10-11 at 10:00 +0100, Daniel Watts wrote:


.Folder__1.new
.Folder__1.cur
.Folder__1.tmp
and
.Folder__2.new
.Folder__2.cur
.Folder__2.tmp

with Dovecot merging them before display as just "Folder" within the 
mail client.


Virtual folders would enable this, if they're implemented one day..

This could be further extended so that Dovecot could be configured to 
store 'old' message folders in a separate location. We could then have 
slower+cheaper+larger storage mounted so that 'old mail' does not take 
up the expensive local SCSI disks on the machine. Mail from 2 years ago 
is much less likely to be accessed than mail from the last week.


dbox format will support this soon. So that you can configure two (or
more) directories for it and then Dovecot will look up the mail files
from each of them in order. It would also support automatically moving
non-recently accessed mails to the slower dirs.

The current dbox implementation in v1.1 supports only
one-message-per-file mode so it's quite similar to maildir. The main
problem with implementing fast/slow storage for maildir is that the
maildir filenames change all the time, so it would waste the slow
storage's I/O all the time when trying to figure out if a file is there
or not. dbox doesn't have this problem.



Hi Timo!
Digging up this thread from 2007. Just had another conversation in my 
company about how to spread old non-accessed files to cheaper slower 
storage.


Is this now feasible? I noticed dbox is now v2.0 but see no reference to 
virtual folders or auto-archiving etc.


Hope you're having a good time State-side!

Best wishes,
Dan




Re: [Dovecot] Spliting Folders for Efficiency

2007-11-01 Thread Kyle Wheeler

On Saturday, October 13 at 09:25 AM, quoth Daniel W:
Thanks for the insights. Is it also true that to read a single 
message in a 800MB mbox, you need to load 800MB of data into memory 
which is then searched for that message?


Not at all. If you don't know what message you're looking for, then 
yes (kinda: you could just mmap the mbox file, which reduces your 
latency before beginning the search), but Maildir has an even worse 
problem: if you don't know what message you're looking for, you have 
to open and close every single message-file. And open()/close() 
typically has quite a bit more overhead than lseek(). More to the 
point, when searching for a file in an mbox, the OS has a very good 
idea of what you're going to be looking at next (linear search is 
predictable that way), so it can do a much better job of prefetching 
and I/O scheduling for a search through an mbox than it can for a 
Maildir search. Again, mbox wins.


On the other hand, if you know exactly what message you're looking 
for, the necessary I/O is only slightly different. In an mbox, 
"knowing" which message you're looking for is best expressed as an 
offset within the file. Similarly, in a Maildir, "knowing" which 
message you're looking for is best expressed as a filename, or (better 
still, in some cases) an inode number. In an mbox, then, you have to 
open() the file and lseek() to the correct offset (which, in an 
exceedingly large mbox, may require log(sizeoffile) disk accesses to 
begin the first read). In a Maildir, you have to merely open() the 
file, however rather than dealing with the filesystem's method of 
storing a file, you have to deal with the filesystem's method of 
storing filenames. In fancy filesystems (e.g. ReiserFS or ext3 with 
dir_hashing turned on), this can be pretty fast ---on the order of 
log(numberofmessages), but in boring filesystems (e.g. ext2, ext3 
without dir_hashing, vfat, etc.) this can take a lot of time. Between 
the two, on average, the I/O load is about the same for both actions, 
though the filesystem particulars are what really make one or the 
other a better fit for a given situation.


The really irritating thing about Maildir is that the filenames can 
change, meaning that "knowing" which message you want (i.e. you have a  
filename) may still mean you have to scan through the list of 
available filenames and see which ones are similar to the name you 
wanted (see why having an inode number can be more useful?), which 
takes MUCH longer than lseek().


That would suggest that mbox is only scaleable to a realtively small 
inbox size.


Not really.

eg. Splitting by message size. If a message is much smaller than the 
block size, use a single file format and if larger, write out to 
it's own file. Every folder would have two mechanisms and Dovecot 
could just look at each message as it comes in to decide how to 
store it.


Yes, but then you get to the question of: what does that buy you? And, 
better still: how do you find any given message? Filename+offset? 
You'd be compounding the worst details of both designs. Not only do 
you have to lseek() to find your small message, but you have to pay 
the filename lookup penalty as well---even if you know exactly where 
your message is. On the other hand, you've reduced the cost of both by 
relying on the other: your lseek overhead is lower because you are 
dealing with a smaller file than you'd ordinarily have to, and your 
filename lookup overhead is lower because you've got fewer files. So, 
whether this is a good idea probably, once again, depends very much on 
where the performance curves bend (e.g. if your filesystem gets much 
slower for more than 10,000 files in one directory, or if it gets much 
slower if your file is over 1G, or something like that). If your 
filesystem scales linearly, though, it's not a net gain.


~Kyle
--
Come to me, son of Jor-El. Kneel before Zod. Snootchie-bootchies.
-- Jay


pgp8nBaNVnEph.pgp
Description: PGP signature


Re: [Dovecot] Spliting Folders for Efficiency

2007-10-20 Thread Timo Sirainen
On Thu, 2007-10-11 at 10:00 +0100, Daniel Watts wrote:

> .Folder__1.new
> .Folder__1.cur
> .Folder__1.tmp
> and
> .Folder__2.new
> .Folder__2.cur
> .Folder__2.tmp
> 
> with Dovecot merging them before display as just "Folder" within the 
> mail client.

Virtual folders would enable this, if they're implemented one day..

> This could be further extended so that Dovecot could be configured to 
> store 'old' message folders in a separate location. We could then have 
> slower+cheaper+larger storage mounted so that 'old mail' does not take 
> up the expensive local SCSI disks on the machine. Mail from 2 years ago 
> is much less likely to be accessed than mail from the last week.

dbox format will support this soon. So that you can configure two (or
more) directories for it and then Dovecot will look up the mail files
from each of them in order. It would also support automatically moving
non-recently accessed mails to the slower dirs.

The current dbox implementation in v1.1 supports only
one-message-per-file mode so it's quite similar to maildir. The main
problem with implementing fast/slow storage for maildir is that the
maildir filenames change all the time, so it would waste the slow
storage's I/O all the time when trying to figure out if a file is there
or not. dbox doesn't have this problem.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] Spliting Folders for Efficiency

2007-10-13 Thread Richard Laager
On Sat, 2007-10-13 at 09:25 +0100, Daniel W wrote:
> Is it also true that to read a single message 
> in a 800MB mbox, you need to load 800MB of data into memory which is 
> then searched for that message?

Of course not! That's what an index is for.

Richard


signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] Spliting Folders for Efficiency

2007-10-13 Thread Daniel W

Kyle Wheeler wrote:

On Friday, October 12 at 11:06 AM, quoth Daniel Watts:
What actually ARE the advantages of a 'one file per folder' format?? 


It depends on the environment. It's exceedingly efficient at storage: on 
a filesystem with 4k blocks, three 1k messages take up 1 block (4k), 
where in a one-file-per-message format they take up 3 blocks (12k). Some 
filesystems have mechanisms of coping with files that only occupy a 
partial block, but those mechanisms tend to be expensive, and  are often 
only employed when strapped for space. The one-file-per-folder 
arrangement also helps when doing sequential reads (i.e. searches, or 
loading it into memory, or processing it with a filter, or whatever 
else): when the OS spools the file from disk, it loads it up a block at 
a time, which in a one-file-per-folder format is several messages, but 
in a one-file-per-message format is only ever a single message.


I've often contemplated setting up a separate mbox-based namespace in my 
Dovecot setup (e.g. everything in the Archive folder is saved as an 
mbox), just for the space savings.




Thanks for the insights. Is it also true that to read a single message 
in a 800MB mbox, you need to load 800MB of data into memory which is 
then searched for that message? That would suggest that mbox is only 
scaleable to a realtively small inbox size.


There are other tactics that could be considered as well.

eg. Splitting by message size. If a message is much smaller than the 
block size, use a single file format and if larger, write out to it's 
own file. Every folder would have two mechanisms and Dovecot could just 
look at each message as it comes in to decide how to store it.


Messages are normally quite small but attachments are not. One could 
have a separate attachments directory that stores files individually. 
This would keep the mbox small and Dovecot would fetch attachments as 
needed and never load them into memory otherwise.


However inevitably the mbox will still grow large and the original 
(proposed) problem of "reading a large file to find a single small 
message" returns, which would mean I remain unconvinced about the 
scaleabilty of mbox.


Re: [Dovecot] Spliting Folders for Efficiency

2007-10-12 Thread Kyle Wheeler

On Friday, October 12 at 11:06 AM, quoth Daniel Watts:
What actually ARE the advantages of a 'one file per folder' format?? 


It depends on the environment. It's exceedingly efficient at storage: 
on a filesystem with 4k blocks, three 1k messages take up 1 block 
(4k), where in a one-file-per-message format they take up 3 blocks 
(12k). Some filesystems have mechanisms of coping with files that only 
occupy a partial block, but those mechanisms tend to be expensive, and  
are often only employed when strapped for space. The 
one-file-per-folder arrangement also helps when doing sequential reads 
(i.e. searches, or loading it into memory, or processing it with a 
filter, or whatever else): when the OS spools the file from disk, it 
loads it up a block at a time, which in a one-file-per-folder format 
is several messages, but in a one-file-per-message format is only ever 
a single message.


I've often contemplated setting up a separate mbox-based namespace in 
my Dovecot setup (e.g. everything in the Archive folder is saved as an 
mbox), just for the space savings.


~Kyle
--
Only the fool hopes to repeat an experience; the wise man knows that 
every experience is to be viewed as a blessing.

   -- Henry Miller


pgpn7Yd1yyMdC.pgp
Description: PGP signature


Re: [Dovecot] Spliting Folders for Efficiency

2007-10-12 Thread Daniel Watts

Chris Laif wrote:

On 10/11/07, Daniel Watts <[EMAIL PROTECTED]> wrote:

Dear Timo,

Would there be any sense in giving Dovecot the option to split folders
into multiple subfolders when they reached a specified size (probably
message count) limit?



Many modern file systems offer the possibility to use optimized
directory indexes. Listing these directories scales very well.
Splitting files into subdirectories would have a negative effect: You
have to walk through every directory and merge all file names into one
data table.

Chris



That is true. But it still leaves the motivation of being able to store 
rarely accessed 'old' mail in a separate, perhaps remote, location which 
I can see as valuable. Even though storage is pretty cheap, expensive 
disks...are not cheap =)




Re: [Dovecot] Spliting Folders for Efficiency

2007-10-12 Thread Chris Laif
On 10/11/07, Daniel Watts <[EMAIL PROTECTED]> wrote:
> Dear Timo,
>
> Would there be any sense in giving Dovecot the option to split folders
> into multiple subfolders when they reached a specified size (probably
> message count) limit?
>

Many modern file systems offer the possibility to use optimized
directory indexes. Listing these directories scales very well.
Splitting files into subdirectories would have a negative effect: You
have to walk through every directory and merge all file names into one
data table.

Chris


Re: [Dovecot] Spliting Folders for Efficiency

2007-10-12 Thread Daniel Watts



Curtis Maloney wrote:

Daniel Watts wrote:

Dear Timo,

Would there be any sense in giving Dovecot the option to split 
folders into multiple subfolders when they reached a specified size 
(probably message count) limit?


My understanding is this is partially covered in Timo's "dbox" format, 
which tries to take the best features of mbox and Maildir.

Is dbox production ready? It looks interesting.
http://wiki.dovecot.org/MailboxFormat/dbox this page says it is not 
finished.


What actually ARE the advantages of a 'one file per folder' format?? We 
switched to Maildir because mbox was killing our server. I wouldn't ever 
switch back.
The only thing perhaps is faster Search since you don't have to open 
lots of files. But for this I reckon it would be best to keep a separate 
index of content. Dreams of offering a 'google like' imap-search 
function anyone? =) Are there any (preferably open source) products out 
there for this?



.Folder.new
.Folder.cur
.Folder.tmp

could become:

.Folder__1.new
.Folder__1.cur
.Folder__1.tmp
and
.Folder__2.new
.Folder__2.cur
.Folder__2.tmp


You would only need to split "cur", unless you expect someone to get 
over 10,000 new message waiting.  "tmp" is only used _whilst_ message 
are being delivered, so mail clients don't see a partially written 
message.

Ah yes this is true.


This could be further extended so that Dovecot could be configured to 
store 'old' message folders in a separate location. We could then 
have slower+cheaper+larger storage mounted so that 'old mail' does 
not take up the expensive local SCSI disks on the machine. Mail from 
2 years ago is much less likely to be accessed than mail from the 
last week.


Also, instead of __N, you could try a different path, so 
/foo/bar/User/ is for new mail, and /old/slow/disk/User is for older 
stuff.
ah yes - and if it is on the same disk it could just be 
$HOME/Maildir/cur and $HOME/Maildir/old/cur



This would provide very neat behind-the-scenes archiving functionality.


There's really two ideas here... one is the mechanism of 
multi-directory folders, the other is the policy of separating by age.

Ideally there would be a few limits set by the system admin:
Min Age of mail
Max Age of mail
Min number of messages.
Max number of messages.

You can then split by either volume or age and control how many emails 
to keep in 'fast' storage as a minimum - eg always have the most recent 
50 emails in local storage, regardless of age.


Dan


Re: [Dovecot] Spliting Folders for Efficiency

2007-10-11 Thread Curtis Maloney

Daniel Watts wrote:

Dear Timo,

Would there be any sense in giving Dovecot the option to split folders 
into multiple subfolders when they reached a specified size (probably 
message count) limit?


My understanding is this is partially covered in Timo's "dbox" format, which 
tries to take the best features of mbox and Maildir.



.Folder.new
.Folder.cur
.Folder.tmp

could become:

.Folder__1.new
.Folder__1.cur
.Folder__1.tmp
and
.Folder__2.new
.Folder__2.cur
.Folder__2.tmp


You would only need to split "cur", unless you expect someone to get over 10,000 
new message waiting.  "tmp" is only used _whilst_ message are being delivered, 
so mail clients don't see a partially written message.


This could be further extended so that Dovecot could be configured to 
store 'old' message folders in a separate location. We could then have 
slower+cheaper+larger storage mounted so that 'old mail' does not take 
up the expensive local SCSI disks on the machine. Mail from 2 years ago 
is much less likely to be accessed than mail from the last week.


Also, instead of __N, you could try a different path, so /foo/bar/User/ is for 
new mail, and /old/slow/disk/User is for older stuff.



This would provide very neat behind-the-scenes archiving functionality.


There's really two ideas here... one is the mechanism of multi-directory 
folders, the other is the policy of separating by age.


--
Curtis Maloney
[EMAIL PROTECTED]



[Dovecot] Spliting Folders for Efficiency

2007-10-11 Thread Daniel Watts

Dear Timo,

Would there be any sense in giving Dovecot the option to split folders 
into multiple subfolders when they reached a specified size (probably 
message count) limit?


Dovecot would monitor folders and when they reached, say, 10,000 
messages, silently split the folder on the filesystem to ensure that 
access remains fast.


I know that Dovecot scales very well but this would give practically 
unlimited storage capability and also keep things fast. You could even 
have it so that the latest 100 messages are kept in their own folder for 
fast access.


.Folder.new
.Folder.cur
.Folder.tmp

could become:

.Folder__1.new
.Folder__1.cur
.Folder__1.tmp
and
.Folder__2.new
.Folder__2.cur
.Folder__2.tmp

with Dovecot merging them before display as just "Folder" within the 
mail client.


This could be further extended so that Dovecot could be configured to 
store 'old' message folders in a separate location. We could then have 
slower+cheaper+larger storage mounted so that 'old mail' does not take 
up the expensive local SCSI disks on the machine. Mail from 2 years ago 
is much less likely to be accessed than mail from the last week.


This would provide very neat behind-the-scenes archiving functionality.

Looking forward to hearing your thoughts.

Best,
Daniel

--
Squirrelmail Stable 1.4.8 (and developing on 1.5.2)
PHP 5.x Hardened with Eaccelerator
Apache 2.x
Mysql 5.0.x
Imapproxy over Dovecot 1.0.rc27 with Maildir
all running on Gentoo Linux
for ~5,000 users.