Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Charles Marcus

On 2012-06-28 4:22 PM, Alex Crow  wrote:

On 28/06/12 20:28, Charles Marcus wrote:

On 2012-06-28 2:04 PM, Gary Mort  wrote:

That's probably due to the different structures they use.   sdbox
can safely use either because each email message has a unique
filename, and if it exists in both places it doesn't matter.



Eh?? Sdbox is like mbox - one file per mailbox/folder... it is NOT
like maildir (one email = one file).



Not according to the wiki:

http://wiki2.dovecot.org/MailboxFormat/dbox

dbox can be used in two ways:

 single-dbox (sdbox in mail location): One message per file,
similar to Maildir. For backwards compatibility, dbox is an alias to
sdbox in mail_location.


Now how  the heck did I remember that so wrong??

Oh well, thanks for the correction...

Sorry, OP...

--

Best regards,

Charles


Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Alex Crow

On 28/06/12 20:28, Charles Marcus wrote:

On 2012-06-28 2:04 PM, Gary Mort  wrote:

That's probably due to the different structures they use.   sdbox
can safely use either because each email message has a unique
filename, and if it exists in both places it doesn't matter.


Eh?? Sdbox is like mbox - one file per mailbox/folder... it is NOT 
like maildir (one email = one file).




Not according to the wiki:

http://wiki2.dovecot.org/MailboxFormat/dbox

   dbox can be used in two ways:

single-dbox (sdbox in mail location): One message per file,
   similar to Maildir. For backwards compatibility, dbox is an alias to
   sdbox in mail_location.

multi-dbox (mdbox in mail location): Multiple messages per
   file, but unlike mbox multiple files per mailbox.


So the parent appears to be right.

Alex

--
This message is intended only for the addressee and may contain
confidential information.  Unless you are that person, you may not
disclose its contents or use it in any way and are requested to delete
the message along with any attachments and notify us immediately.

"Transact" is operated by Integrated Financial Arrangements plc
Domain House, 5-7 Singer Street, London  EC2A 4BQ
Tel: (020) 7608 4900 Fax: (020) 7608 5300
(Registered office: as above; Registered in England and Wales under number: 
3727592)
Authorised and regulated by the Financial Services Authority (entered on the 
FSA Register; number: 190856)



Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Charles Marcus

On 2012-06-28 2:04 PM, Gary Mort  wrote:

That's probably due to the different structures they use.   sdbox
can safely use either because each email message has a unique
filename, and if it exists in both places it doesn't matter.


Eh?? Sdbox is like mbox - one file per mailbox/folder... it is NOT like 
maildir (one email = one file).



mdbox though is different, multiple messages are stored in a single
file.


The diff between mdbox and sdbox is sdbox puts all messages for any 
given mailbox/folder in one sdbox file (just like mbox). Sdbox has a 
setting for the max filesize of the dbox file, and once an mdbox file 
exceeds that size, it creates a new mdbox file to start adding messages to.


--

Best regards,

Charles


Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Timo Sirainen
On 28.6.2012, at 21.04, Gary Mort wrote:

> mdbox though is different, multiple messages are stored in a single file.
> The index indicates in which file each message is located.  When the data
> is moved to alt storage, the filename can change in which case the index is
> updated.
> IE:
> Primary/Msg06282012 -- contains Msg007, Msg008, Msg009
> Primary/Msg06272012 -- contains Msg004, Msg005, Msg006
> Primary/Msg06262012 -- contains Msg001, Msg002, Msg003
> 
> along comes archiving and the new format is:
> Primary/Msg06292012 -- contains Msg010, Msg011, Msg012
> Primary/Msg06282012 -- contains Msg007,  Msg009
> Primary/Msg06272012 -- contains Msg004,  Msg006
> Primary/Msg06262012 -- contains Msg003
> Alt/Msg06292012 00 contains Msg001, Msg002, Msg005, Msg008

Yes, doveadm altmove works like this now.

> Since the archive rules can be based on a lot of different scenarios[and a
> message can even be archived from the command line], the filenames between
> Primary and Alternate are not the same - and in fact the same filename in
> each place could have different messages.  For example: if messages are
> archived when a user sets an imap flag on them.

There shouldn't normally ever be a situation where the same filename is used in 
both storages, because every time a new file is created to either of the 
storages a new unique number is used.

> So with the way it's written now, it's not possible to have a simple
> fallback by filename.
> 
> It would be possible if the naming convention was strictly enforced, ie
> after archiving you have:
> Primary/Msg06292012 -- contains Msg010, Msg011, Msg012
> Primary/Msg06282012 -- contains Msg007,  Msg009
> Primary/Msg06272012 -- contains Msg004,  Msg006
> Primary/Msg06262012 -- contains Msg003
> Alt/Msg06282012 -- contains Msg008
> Alt/Msg06272012 -- contains Msg005
> Alt/Msg06262012 -- contains Msg001, Msg002
> 
> Now the index can simply say what file a message is in and doesn't have to
> specify primary or secondary, and the primary file with that name can be
> checked first, and then if it is not there check the alternate.

This already works like that in the reading side. If you did altmoving by "mv 
m.123 /altstorage/..." instead of doveadm it would work.

Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Gary Mort
On Thu, Jun 28, 2012 at 1:21 PM, Timo Sirainen  wrote:

> On 28.6.2012, at 20.14, Timo Sirainen wrote:
>
> >> "An upshot of the way alternate storage works is that any given storage
> >> file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox))
> can
> >> only appear *either* in the primary storage area *or* the alternate
> storage
> >> area but not both — if the corresponding file appears in both areas then
> >> there is an inconsistency."
> >
> > Whoever wrote that wasn't exactly correct (or clear). There's no problem
> having the same file in both primary and alt storage. Only if the files are
> different there's a problem, but that shouldn't happen..
>
> Hmm. Although looking at the mdbox index rebuilding code:
>
>/* duplicate file. either readdir() returned it twice
>   (unlikely) or it exists in both alt and primary storage.
>   to make sure we don't lose any mails from either of the
>   files, give this file a new ID and rename it. */
>
> It probably shouldn't be doing that. sdbox isn't doing that:
>
>/* we were supposed to open the file in alt storage, but it
>   exists in primary storage as well. skip it to avoid
> adding
>   it twice. */
>
>
That's probably due to the different structures they use.   sdbox can
safely use either because each email message has a unique filename, and if
it exists in both places it doesn't matter.

mdbox though is different, multiple messages are stored in a single file.
 The index indicates in which file each message is located.  When the data
is moved to alt storage, the filename can change in which case the index is
updated.
IE:
Primary/Msg06282012 -- contains Msg007, Msg008, Msg009
Primary/Msg06272012 -- contains Msg004, Msg005, Msg006
Primary/Msg06262012 -- contains Msg001, Msg002, Msg003

along comes archiving and the new format is:
Primary/Msg06292012 -- contains Msg010, Msg011, Msg012
Primary/Msg06282012 -- contains Msg007,  Msg009
Primary/Msg06272012 -- contains Msg004,  Msg006
Primary/Msg06262012 -- contains Msg003
Alt/Msg06292012 00 contains Msg001, Msg002, Msg005, Msg008

Since the archive rules can be based on a lot of different scenarios[and a
message can even be archived from the command line], the filenames between
Primary and Alternate are not the same - and in fact the same filename in
each place could have different messages.  For example: if messages are
archived when a user sets an imap flag on them.

So with the way it's written now, it's not possible to have a simple
fallback by filename.

It would be possible if the naming convention was strictly enforced, ie
after archiving you have:
Primary/Msg06292012 -- contains Msg010, Msg011, Msg012
Primary/Msg06282012 -- contains Msg007,  Msg009
Primary/Msg06272012 -- contains Msg004,  Msg006
Primary/Msg06262012 -- contains Msg003
Alt/Msg06282012 -- contains Msg008
Alt/Msg06272012 -- contains Msg005
Alt/Msg06262012 -- contains Msg001, Msg002

Now the index can simply say what file a message is in and doesn't have to
specify primary or secondary, and the primary file with that name can be
checked first, and then if it is not there check the alternate.


Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Timo Sirainen
On 28.6.2012, at 20.55, Gary Mort wrote:

>> The indexes have to be in primary storage.
>> 
> True, but the data they are based on I'm assuming does not include the full
> email message, just a few key pieces:
> uniqueid, subject, from, to, etc.
> 
> For an always running server, the indexes are always up to date in primary.
> 
> For a server starting up with no index data, it will need to rebuild the
> index information[or for a second server running when new email has been
> delivered].
> As such, rather then download every single email message just for a few
> bits of key info, I can run a re-index process to pull just the meta
> information and grab the data from there.

With sdbox you can't lose index files without also losing all message flags. 
And in general sdbox assumes that indexes are always up to date.

>>> When a client attempts to retrieve an email message, Dovecot would check
>>> primary storage as it does now, if the message is not found than it will
>>> retrieve it from the alternate storage system AND store a copy in the
>>> primary storage.
>> 
>> I think the storing wouldn't be very useful. Most clients download the
>> message once. There's no reason to cache it if it doesn't get downloaded
>> again. The way it should work that new mails are immediately delivered to
>> both primary and alt storage.
>> 
>> 
> I've got tons of space - so I don't mind having 750MB or so for primary
> email message storage.   If I can track how many times a message was
> actually read, over time I can get an idea of how I use it and setup the
> primary storage purge rules accordingly.

I'd be interested in knowing what those statistics will end up looking like. My 
guess is that it's not worth coding such feature, but of course some real world 
data would be better than my guesses :)

>>> Secondly, I'd like to replace the Mysql database usage with a simpleDB
>>> database.  While simpleDB lacks much of MySQL's sophistication, it
>> doesn't
>>> seem that Dovecot is really using any of that, so simpleDB can be
>>> functionally equivalent.
>> 
>> Dovecot will probably get Redis and/or memcache backend for passdb+userdb.
>> If simpledb is similar key-value database I guess the same code could be
>> used partially.
>> 
>> 
> simpleDB is more like SQLLITE:
..
> You query the data like an SQL table:
> http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/UsingSelect.html

OK, so that would mean implementing lib-sql driver for SimpleDB and use sql 
passdb/userdb.

Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Gary Mort
On Thu, Jun 28, 2012 at 1:14 PM, Timo Sirainen  wrote:

> On 28.6.2012, at 17.43, Gary Mort wrote:
> > First I want to add AWS S3 as a storage option for alternate storage.
> >
> > Then instead of the above model, the new model would be that email is
> > always stored in alternate storage, and may be in primary storage.  So,
> > when mail comes in, I'd have Dovecot save the email to the alternate
> > storage S3 bucket and update the indexs and other information[ideally,
> for
> > convenience purposes, a few bits of relevant indexing information can be
> > stored as metadata in the S3 object  - sufficient so that instead of
> > retrieving the entire S3 object, just the meta data can be pulled to
> build
> > indexes.
>
> The indexes have to be in primary storage.
>
>
True, but the data they are based on I'm assuming does not include the full
email message, just a few key pieces:
uniqueid, subject, from, to, etc.

For an always running server, the indexes are always up to date in primary.

For a server starting up with no index data, it will need to rebuild the
index information[or for a second server running when new email has been
delivered].
As such, rather then download every single email message just for a few
bits of key info, I can run a re-index process to pull just the meta
information and grab the data from there.


>  > When a client attempts to retrieve an email message, Dovecot would check
> > primary storage as it does now, if the message is not found than it will
> > retrieve it from the alternate storage system AND store a copy in the
> > primary storage.
>
> I think the storing wouldn't be very useful. Most clients download the
> message once. There's no reason to cache it if it doesn't get downloaded
> again. The way it should work that new mails are immediately delivered to
> both primary and alt storage.
>
>
I've got tons of space - so I don't mind having 750MB or so for primary
email message storage.   If I can track how many times a message was
actually read, over time I can get an idea of how I use it and setup the
primary storage purge rules accordingly.


> > Secondly, I'd like to replace the Mysql database usage with a simpleDB
> > database.  While simpleDB lacks much of MySQL's sophistication, it
> doesn't
> > seem that Dovecot is really using any of that, so simpleDB can be
> > functionally equivalent.
>
> Dovecot will probably get Redis and/or memcache backend for passdb+userdb.
> If simpledb is similar key-value database I guess the same code could be
> used partially.
>
>
simpleDB is more like SQLLITE:
"Amazon SimpleDB is a highly available and flexible non-relational data
store that offloads the work of database administration. Developers simply
store and query data items via web services requests and Amazon SimpleDB
does the rest."
http://aws.amazon.com/simpledb/

Data model:
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html

Domain == Table
Item == row
ItemName == primary key
Attributes == column
Value == data in column[multi value, so there can be multiple values for an
attribute of an item]

There is no built in key relationship between data, it's just one big flat
table.   Columns/Attributes only have 2 types, string or integer

You query the data like an SQL table:
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/UsingSelect.html


Because there are no dates, it's best to store dates as UTC timestamps
which are integers and can then be compared against numerically.

The datastore is spread over multiple Amazon data servers and can take up
to a second to sync, so there are two methods of querying the data.
Default: eventually consistent read: get the data quickly
Optional: consistent read: check /all/ datastores and get the latest data

Since the data in simpleDB may not be updated frequently, a simple hack
using the notification system could be:
Before updating simpleDB send SNS notice that the data is being updated and
where[domain, user, config]
Update Data
After updating simpleDB send SNS notice that the update is complete

Other servers running can record data updating notices in memory and expire
them in about 15 seconds.   For any queries they want to make for that type
of data in the next 15 seconds, they will use consistent read.


The nice thing about using S3 and simpleDB is that I can completely skip a
lot of steps in replication/distributed services as it is all handled
already.  And one can always take one set of api calls and substitute
another for a different notification system, distributed database, and
cloud file storage.


Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Timo Sirainen
On 28.6.2012, at 20.21, Timo Sirainen wrote:

> On 28.6.2012, at 20.14, Timo Sirainen wrote:
> 
>>> "An upshot of the way alternate storage works is that any given storage
>>> file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox)) can
>>> only appear *either* in the primary storage area *or* the alternate storage
>>> area but not both — if the corresponding file appears in both areas then
>>> there is an inconsistency."
>> 
>> Whoever wrote that wasn't exactly correct (or clear). There's no problem 
>> having the same file in both primary and alt storage. Only if the files are 
>> different there's a problem, but that shouldn't happen..
> 
> Hmm. Although looking at the mdbox index rebuilding code:
> 
>   /* duplicate file. either readdir() returned it twice
>  (unlikely) or it exists in both alt and primary storage.
>  to make sure we don't lose any mails from either of the
>  files, give this file a new ID and rename it. */
> 
> It probably shouldn't be doing that.

Hmm. I already implemented this by having it ignore the problem if the files 
have the same sizes, but then started wondering if there's really any point in 
doing that. m.* files can be appended to later, and altmoving always creates 
files with new numbers, and even if it does renaming there's duplicate 
suppression, so .. I guess there wasn't any point in doing that after all.

Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Timo Sirainen
On 28.6.2012, at 20.14, Timo Sirainen wrote:

>> "An upshot of the way alternate storage works is that any given storage
>> file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox)) can
>> only appear *either* in the primary storage area *or* the alternate storage
>> area but not both — if the corresponding file appears in both areas then
>> there is an inconsistency."
> 
> Whoever wrote that wasn't exactly correct (or clear). There's no problem 
> having the same file in both primary and alt storage. Only if the files are 
> different there's a problem, but that shouldn't happen..

Hmm. Although looking at the mdbox index rebuilding code:

/* duplicate file. either readdir() returned it twice
   (unlikely) or it exists in both alt and primary storage.
   to make sure we don't lose any mails from either of the
   files, give this file a new ID and rename it. */

It probably shouldn't be doing that. sdbox isn't doing that:

/* we were supposed to open the file in alt storage, but it
   exists in primary storage as well. skip it to avoid adding
   it twice. */



Re: [Dovecot] Integrating Dovecot with Amazon Web Services

2012-06-28 Thread Timo Sirainen
On 28.6.2012, at 17.43, Gary Mort wrote:

> http://wiki2.dovecot.org/MailboxFormat/dbox
> 
> To make life easy, I'll stick with just single-dbox as a start, however
> multi-dbox would be doable.
> 
> With dbox, the only thing that I need to change is the alternate storage
> model:
> "An upshot of the way alternate storage works is that any given storage
> file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox)) can
> only appear *either* in the primary storage area *or* the alternate storage
> area but not both — if the corresponding file appears in both areas then
> there is an inconsistency."

Whoever wrote that wasn't exactly correct (or clear). There's no problem having 
the same file in both primary and alt storage. Only if the files are different 
there's a problem, but that shouldn't happen..

> First I want to add AWS S3 as a storage option for alternate storage.
> 
> Then instead of the above model, the new model would be that email is
> always stored in alternate storage, and may be in primary storage.  So,
> when mail comes in, I'd have Dovecot save the email to the alternate
> storage S3 bucket and update the indexs and other information[ideally, for
> convenience purposes, a few bits of relevant indexing information can be
> stored as metadata in the S3 object  - sufficient so that instead of
> retrieving the entire S3 object, just the meta data can be pulled to build
> indexes.

The indexes have to be in primary storage.

> When a client attempts to retrieve an email message, Dovecot would check
> primary storage as it does now, if the message is not found than it will
> retrieve it from the alternate storage system AND store a copy in the
> primary storage.

I think the storing wouldn't be very useful. Most clients download the message 
once. There's no reason to cache it if it doesn't get downloaded again. The way 
it should work that new mails are immediately delivered to both primary and alt 
storage.

> Secondly, I'd like to replace the Mysql database usage with a simpleDB
> database.  While simpleDB lacks much of MySQL's sophistication, it doesn't
> seem that Dovecot is really using any of that, so simpleDB can be
> functionally equivalent.

Dovecot will probably get Redis and/or memcache backend for passdb+userdb. If 
simpledb is similar key-value database I guess the same code could be used 
partially.