Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 2012-06-28 4:22 PM, Alex Crow wrote: On 28/06/12 20:28, Charles Marcus wrote: On 2012-06-28 2:04 PM, Gary Mort wrote: That's probably due to the different structures they use. sdbox can safely use either because each email message has a unique filename, and if it exists in both places it doesn't matter. Eh?? Sdbox is like mbox - one file per mailbox/folder... it is NOT like maildir (one email = one file). Not according to the wiki: http://wiki2.dovecot.org/MailboxFormat/dbox dbox can be used in two ways: single-dbox (sdbox in mail location): One message per file, similar to Maildir. For backwards compatibility, dbox is an alias to sdbox in mail_location. Now how the heck did I remember that so wrong?? Oh well, thanks for the correction... Sorry, OP... -- Best regards, Charles
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 28/06/12 20:28, Charles Marcus wrote: On 2012-06-28 2:04 PM, Gary Mort wrote: That's probably due to the different structures they use. sdbox can safely use either because each email message has a unique filename, and if it exists in both places it doesn't matter. Eh?? Sdbox is like mbox - one file per mailbox/folder... it is NOT like maildir (one email = one file). Not according to the wiki: http://wiki2.dovecot.org/MailboxFormat/dbox dbox can be used in two ways: single-dbox (sdbox in mail location): One message per file, similar to Maildir. For backwards compatibility, dbox is an alias to sdbox in mail_location. multi-dbox (mdbox in mail location): Multiple messages per file, but unlike mbox multiple files per mailbox. So the parent appears to be right. Alex -- This message is intended only for the addressee and may contain confidential information. Unless you are that person, you may not disclose its contents or use it in any way and are requested to delete the message along with any attachments and notify us immediately. "Transact" is operated by Integrated Financial Arrangements plc Domain House, 5-7 Singer Street, London EC2A 4BQ Tel: (020) 7608 4900 Fax: (020) 7608 5300 (Registered office: as above; Registered in England and Wales under number: 3727592) Authorised and regulated by the Financial Services Authority (entered on the FSA Register; number: 190856)
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 2012-06-28 2:04 PM, Gary Mort wrote: That's probably due to the different structures they use. sdbox can safely use either because each email message has a unique filename, and if it exists in both places it doesn't matter. Eh?? Sdbox is like mbox - one file per mailbox/folder... it is NOT like maildir (one email = one file). mdbox though is different, multiple messages are stored in a single file. The diff between mdbox and sdbox is sdbox puts all messages for any given mailbox/folder in one sdbox file (just like mbox). Sdbox has a setting for the max filesize of the dbox file, and once an mdbox file exceeds that size, it creates a new mdbox file to start adding messages to. -- Best regards, Charles
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 28.6.2012, at 21.04, Gary Mort wrote: > mdbox though is different, multiple messages are stored in a single file. > The index indicates in which file each message is located. When the data > is moved to alt storage, the filename can change in which case the index is > updated. > IE: > Primary/Msg06282012 -- contains Msg007, Msg008, Msg009 > Primary/Msg06272012 -- contains Msg004, Msg005, Msg006 > Primary/Msg06262012 -- contains Msg001, Msg002, Msg003 > > along comes archiving and the new format is: > Primary/Msg06292012 -- contains Msg010, Msg011, Msg012 > Primary/Msg06282012 -- contains Msg007, Msg009 > Primary/Msg06272012 -- contains Msg004, Msg006 > Primary/Msg06262012 -- contains Msg003 > Alt/Msg06292012 00 contains Msg001, Msg002, Msg005, Msg008 Yes, doveadm altmove works like this now. > Since the archive rules can be based on a lot of different scenarios[and a > message can even be archived from the command line], the filenames between > Primary and Alternate are not the same - and in fact the same filename in > each place could have different messages. For example: if messages are > archived when a user sets an imap flag on them. There shouldn't normally ever be a situation where the same filename is used in both storages, because every time a new file is created to either of the storages a new unique number is used. > So with the way it's written now, it's not possible to have a simple > fallback by filename. > > It would be possible if the naming convention was strictly enforced, ie > after archiving you have: > Primary/Msg06292012 -- contains Msg010, Msg011, Msg012 > Primary/Msg06282012 -- contains Msg007, Msg009 > Primary/Msg06272012 -- contains Msg004, Msg006 > Primary/Msg06262012 -- contains Msg003 > Alt/Msg06282012 -- contains Msg008 > Alt/Msg06272012 -- contains Msg005 > Alt/Msg06262012 -- contains Msg001, Msg002 > > Now the index can simply say what file a message is in and doesn't have to > specify primary or secondary, and the primary file with that name can be > checked first, and then if it is not there check the alternate. This already works like that in the reading side. If you did altmoving by "mv m.123 /altstorage/..." instead of doveadm it would work.
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On Thu, Jun 28, 2012 at 1:21 PM, Timo Sirainen wrote: > On 28.6.2012, at 20.14, Timo Sirainen wrote: > > >> "An upshot of the way alternate storage works is that any given storage > >> file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox)) > can > >> only appear *either* in the primary storage area *or* the alternate > storage > >> area but not both — if the corresponding file appears in both areas then > >> there is an inconsistency." > > > > Whoever wrote that wasn't exactly correct (or clear). There's no problem > having the same file in both primary and alt storage. Only if the files are > different there's a problem, but that shouldn't happen.. > > Hmm. Although looking at the mdbox index rebuilding code: > >/* duplicate file. either readdir() returned it twice > (unlikely) or it exists in both alt and primary storage. > to make sure we don't lose any mails from either of the > files, give this file a new ID and rename it. */ > > It probably shouldn't be doing that. sdbox isn't doing that: > >/* we were supposed to open the file in alt storage, but it > exists in primary storage as well. skip it to avoid > adding > it twice. */ > > That's probably due to the different structures they use. sdbox can safely use either because each email message has a unique filename, and if it exists in both places it doesn't matter. mdbox though is different, multiple messages are stored in a single file. The index indicates in which file each message is located. When the data is moved to alt storage, the filename can change in which case the index is updated. IE: Primary/Msg06282012 -- contains Msg007, Msg008, Msg009 Primary/Msg06272012 -- contains Msg004, Msg005, Msg006 Primary/Msg06262012 -- contains Msg001, Msg002, Msg003 along comes archiving and the new format is: Primary/Msg06292012 -- contains Msg010, Msg011, Msg012 Primary/Msg06282012 -- contains Msg007, Msg009 Primary/Msg06272012 -- contains Msg004, Msg006 Primary/Msg06262012 -- contains Msg003 Alt/Msg06292012 00 contains Msg001, Msg002, Msg005, Msg008 Since the archive rules can be based on a lot of different scenarios[and a message can even be archived from the command line], the filenames between Primary and Alternate are not the same - and in fact the same filename in each place could have different messages. For example: if messages are archived when a user sets an imap flag on them. So with the way it's written now, it's not possible to have a simple fallback by filename. It would be possible if the naming convention was strictly enforced, ie after archiving you have: Primary/Msg06292012 -- contains Msg010, Msg011, Msg012 Primary/Msg06282012 -- contains Msg007, Msg009 Primary/Msg06272012 -- contains Msg004, Msg006 Primary/Msg06262012 -- contains Msg003 Alt/Msg06282012 -- contains Msg008 Alt/Msg06272012 -- contains Msg005 Alt/Msg06262012 -- contains Msg001, Msg002 Now the index can simply say what file a message is in and doesn't have to specify primary or secondary, and the primary file with that name can be checked first, and then if it is not there check the alternate.
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 28.6.2012, at 20.55, Gary Mort wrote: >> The indexes have to be in primary storage. >> > True, but the data they are based on I'm assuming does not include the full > email message, just a few key pieces: > uniqueid, subject, from, to, etc. > > For an always running server, the indexes are always up to date in primary. > > For a server starting up with no index data, it will need to rebuild the > index information[or for a second server running when new email has been > delivered]. > As such, rather then download every single email message just for a few > bits of key info, I can run a re-index process to pull just the meta > information and grab the data from there. With sdbox you can't lose index files without also losing all message flags. And in general sdbox assumes that indexes are always up to date. >>> When a client attempts to retrieve an email message, Dovecot would check >>> primary storage as it does now, if the message is not found than it will >>> retrieve it from the alternate storage system AND store a copy in the >>> primary storage. >> >> I think the storing wouldn't be very useful. Most clients download the >> message once. There's no reason to cache it if it doesn't get downloaded >> again. The way it should work that new mails are immediately delivered to >> both primary and alt storage. >> >> > I've got tons of space - so I don't mind having 750MB or so for primary > email message storage. If I can track how many times a message was > actually read, over time I can get an idea of how I use it and setup the > primary storage purge rules accordingly. I'd be interested in knowing what those statistics will end up looking like. My guess is that it's not worth coding such feature, but of course some real world data would be better than my guesses :) >>> Secondly, I'd like to replace the Mysql database usage with a simpleDB >>> database. While simpleDB lacks much of MySQL's sophistication, it >> doesn't >>> seem that Dovecot is really using any of that, so simpleDB can be >>> functionally equivalent. >> >> Dovecot will probably get Redis and/or memcache backend for passdb+userdb. >> If simpledb is similar key-value database I guess the same code could be >> used partially. >> >> > simpleDB is more like SQLLITE: .. > You query the data like an SQL table: > http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/UsingSelect.html OK, so that would mean implementing lib-sql driver for SimpleDB and use sql passdb/userdb.
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On Thu, Jun 28, 2012 at 1:14 PM, Timo Sirainen wrote: > On 28.6.2012, at 17.43, Gary Mort wrote: > > First I want to add AWS S3 as a storage option for alternate storage. > > > > Then instead of the above model, the new model would be that email is > > always stored in alternate storage, and may be in primary storage. So, > > when mail comes in, I'd have Dovecot save the email to the alternate > > storage S3 bucket and update the indexs and other information[ideally, > for > > convenience purposes, a few bits of relevant indexing information can be > > stored as metadata in the S3 object - sufficient so that instead of > > retrieving the entire S3 object, just the meta data can be pulled to > build > > indexes. > > The indexes have to be in primary storage. > > True, but the data they are based on I'm assuming does not include the full email message, just a few key pieces: uniqueid, subject, from, to, etc. For an always running server, the indexes are always up to date in primary. For a server starting up with no index data, it will need to rebuild the index information[or for a second server running when new email has been delivered]. As such, rather then download every single email message just for a few bits of key info, I can run a re-index process to pull just the meta information and grab the data from there. > > When a client attempts to retrieve an email message, Dovecot would check > > primary storage as it does now, if the message is not found than it will > > retrieve it from the alternate storage system AND store a copy in the > > primary storage. > > I think the storing wouldn't be very useful. Most clients download the > message once. There's no reason to cache it if it doesn't get downloaded > again. The way it should work that new mails are immediately delivered to > both primary and alt storage. > > I've got tons of space - so I don't mind having 750MB or so for primary email message storage. If I can track how many times a message was actually read, over time I can get an idea of how I use it and setup the primary storage purge rules accordingly. > > Secondly, I'd like to replace the Mysql database usage with a simpleDB > > database. While simpleDB lacks much of MySQL's sophistication, it > doesn't > > seem that Dovecot is really using any of that, so simpleDB can be > > functionally equivalent. > > Dovecot will probably get Redis and/or memcache backend for passdb+userdb. > If simpledb is similar key-value database I guess the same code could be > used partially. > > simpleDB is more like SQLLITE: "Amazon SimpleDB is a highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests and Amazon SimpleDB does the rest." http://aws.amazon.com/simpledb/ Data model: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html Domain == Table Item == row ItemName == primary key Attributes == column Value == data in column[multi value, so there can be multiple values for an attribute of an item] There is no built in key relationship between data, it's just one big flat table. Columns/Attributes only have 2 types, string or integer You query the data like an SQL table: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/UsingSelect.html Because there are no dates, it's best to store dates as UTC timestamps which are integers and can then be compared against numerically. The datastore is spread over multiple Amazon data servers and can take up to a second to sync, so there are two methods of querying the data. Default: eventually consistent read: get the data quickly Optional: consistent read: check /all/ datastores and get the latest data Since the data in simpleDB may not be updated frequently, a simple hack using the notification system could be: Before updating simpleDB send SNS notice that the data is being updated and where[domain, user, config] Update Data After updating simpleDB send SNS notice that the update is complete Other servers running can record data updating notices in memory and expire them in about 15 seconds. For any queries they want to make for that type of data in the next 15 seconds, they will use consistent read. The nice thing about using S3 and simpleDB is that I can completely skip a lot of steps in replication/distributed services as it is all handled already. And one can always take one set of api calls and substitute another for a different notification system, distributed database, and cloud file storage.
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 28.6.2012, at 20.21, Timo Sirainen wrote: > On 28.6.2012, at 20.14, Timo Sirainen wrote: > >>> "An upshot of the way alternate storage works is that any given storage >>> file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox)) can >>> only appear *either* in the primary storage area *or* the alternate storage >>> area but not both — if the corresponding file appears in both areas then >>> there is an inconsistency." >> >> Whoever wrote that wasn't exactly correct (or clear). There's no problem >> having the same file in both primary and alt storage. Only if the files are >> different there's a problem, but that shouldn't happen.. > > Hmm. Although looking at the mdbox index rebuilding code: > > /* duplicate file. either readdir() returned it twice > (unlikely) or it exists in both alt and primary storage. > to make sure we don't lose any mails from either of the > files, give this file a new ID and rename it. */ > > It probably shouldn't be doing that. Hmm. I already implemented this by having it ignore the problem if the files have the same sizes, but then started wondering if there's really any point in doing that. m.* files can be appended to later, and altmoving always creates files with new numbers, and even if it does renaming there's duplicate suppression, so .. I guess there wasn't any point in doing that after all.
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 28.6.2012, at 20.14, Timo Sirainen wrote: >> "An upshot of the way alternate storage works is that any given storage >> file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox)) can >> only appear *either* in the primary storage area *or* the alternate storage >> area but not both — if the corresponding file appears in both areas then >> there is an inconsistency." > > Whoever wrote that wasn't exactly correct (or clear). There's no problem > having the same file in both primary and alt storage. Only if the files are > different there's a problem, but that shouldn't happen.. Hmm. Although looking at the mdbox index rebuilding code: /* duplicate file. either readdir() returned it twice (unlikely) or it exists in both alt and primary storage. to make sure we don't lose any mails from either of the files, give this file a new ID and rename it. */ It probably shouldn't be doing that. sdbox isn't doing that: /* we were supposed to open the file in alt storage, but it exists in primary storage as well. skip it to avoid adding it twice. */
Re: [Dovecot] Integrating Dovecot with Amazon Web Services
On 28.6.2012, at 17.43, Gary Mort wrote: > http://wiki2.dovecot.org/MailboxFormat/dbox > > To make life easy, I'll stick with just single-dbox as a start, however > multi-dbox would be doable. > > With dbox, the only thing that I need to change is the alternate storage > model: > "An upshot of the way alternate storage works is that any given storage > file (mailboxes//dbox-Mails/u.* (sdbox) or storage/m.* (mdbox)) can > only appear *either* in the primary storage area *or* the alternate storage > area but not both — if the corresponding file appears in both areas then > there is an inconsistency." Whoever wrote that wasn't exactly correct (or clear). There's no problem having the same file in both primary and alt storage. Only if the files are different there's a problem, but that shouldn't happen.. > First I want to add AWS S3 as a storage option for alternate storage. > > Then instead of the above model, the new model would be that email is > always stored in alternate storage, and may be in primary storage. So, > when mail comes in, I'd have Dovecot save the email to the alternate > storage S3 bucket and update the indexs and other information[ideally, for > convenience purposes, a few bits of relevant indexing information can be > stored as metadata in the S3 object - sufficient so that instead of > retrieving the entire S3 object, just the meta data can be pulled to build > indexes. The indexes have to be in primary storage. > When a client attempts to retrieve an email message, Dovecot would check > primary storage as it does now, if the message is not found than it will > retrieve it from the alternate storage system AND store a copy in the > primary storage. I think the storing wouldn't be very useful. Most clients download the message once. There's no reason to cache it if it doesn't get downloaded again. The way it should work that new mails are immediately delivered to both primary and alt storage. > Secondly, I'd like to replace the Mysql database usage with a simpleDB > database. While simpleDB lacks much of MySQL's sophistication, it doesn't > seem that Dovecot is really using any of that, so simpleDB can be > functionally equivalent. Dovecot will probably get Redis and/or memcache backend for passdb+userdb. If simpledb is similar key-value database I guess the same code could be used partially.