Re: [Dovecot] dsync redesign

2012-04-03 Thread Charles Marcus

On 2012-04-02 7:15 PM, Micah Anderson  wrote:

Charles Marcus  writes:

On 2012-03-27 11:47 AM, Micah Anderson  wrote:

One would be the ability to perform *intelligent* incremental /
rotated backups. I can do this now by running a dsync backup
operation and then doing manual hardlinking or moving of the
backup directories (daily.1, daily.2, weekly.1, monthly.1, etc.),
but it would be more intelligent if this were baked into the
backup process.



There are already numerous tools that do this flawlessly - I've
beenusing rsnapshot (which uses rsync) for this for years.



Are you snapshotting your filesystem (using LVM, or SAN, or similar)
before doing rsnapshot? Because if you aren't then rsync will not
assuredly get everything in a consistent state.


No, and you are correct... but I run it in the middle of the night, and 
the system is only barely utilized at the time, so the very minor 
inconsistencies are not a problem overall.


I will, however, be changing this to using FS snapshots once I get my 
mailserver virtualized (already being planned for when our new office 
location comes online), so that will allow me to perform snapshots 
multiple times during the day (I'm thinking 4 times per day will be enough).



I don't know if Timo should be spending his time reinventing the
wheel.



dsync backup is already here, and it is quite useful.


I'm not saying it isn't, I'm just saying that there are already *plenty* 
of different backup tools, and I don't see the sense in Timo spending 
lots of time on creating a new one just for dovecot. His time would be 
better spent just making it easier for any other backup tool to work better.



Although, one interesting piece that I am hopeful I'll be able to
implement soon (with Timo's professional help) is the ability to
easily and automatically map my rsnapshot snapshots directory to a
read-only 'Backups' namespace that automatically shows the
snapshots by date and time as they are produced. This way users
could 'go back in time' anytime they wanted without having to call
me... :)



Interesting idea, would be a great one to share with the community
if you decide to do so.


Absolutely - that is already on my list for when I pay Timo's company to 
do this - document it on the wiki. Hopefully if any code changes are 
needed to make it work right, they will be minor.


--

Best regards,

Charles


Re: [Dovecot] dsync redesign

2012-04-02 Thread Micah Anderson
Charles Marcus  writes:

> On 2012-03-27 11:47 AM, Micah Anderson  wrote:
>> One would be the ability to perform *intelligent* incremental /
>> rotated backups. I can do this now by running a dsync backup
>> operation and then doing manual hardlinking or moving of the backup
>> directories (daily.1, daily.2, weekly.1, monthly.1, etc.), but it
>> would be more intelligent if this were baked into the backup process.
>
> There are already numerous tools that do this flawlessly - I've been using
> rsnapshot (which uses rsync) for this for years.

Are you snapshotting your filesystem (using LVM, or SAN, or similar)
before doing rsnapshot? Because if you aren't then rsync will not
assuredly get everything in a consistent state.

> I don't know if Timo should be spending his time reinventing the wheel.

dsync backup is already here, and it is quite useful.

> I'm much more interested in dsync working flawlessly to keep one or more
> secondary servers in sync, and leave backups to backup software.

I'm not against that idea, I just have not yet found a good way to use
any backup software in such a way to handle large numbers of user's
mail.

> Although, one interesting piece that I am hopeful I'll be able to implement 
> soon
> (with Timo's professional help) is the ability to easily and automatically map
> my rsnapshot snapshots directory to a read-only 'Backups' namespace that
> automatically shows the snapshots by date and time as they are produced. This
> way users could 'go back in time' anytime they wanted without having to call
> me... :)

Interesting idea, would be a great one to share with the community if
you decide to do so.

micah



Re: [Dovecot] dsync redesign

2012-03-29 Thread Robin
On 3/29/2012 5:24 AM, Stan Hoeppner wrote:
> This happens with a lot of "fan boys".  There was so much hype
> surrounding ZFS that even many logically thinking people were frothing
> at the mouth waiting to get their hands on it.  Then, as with many/most
> things in the tech world, the goods didn't live up to the hype.

The problem with zfs especially is that there are so many different 
implementations, with only the commercial Sun, er, Oracle paid Solaris having 
ALL of the promised features and the bug-fixes to make them safely usable.  For 
those users, with very large RAM-backed Sun, er, Oracle, hardware, it probably 
works well.

FreeBSD and even the last versions of OpenSolaris lack fixes for some wickedly 
nasty box-bricking bugs in de-dup, as well as many of the "sexy" features in 
zpool that had people flocking to it in the first place.  

The bug database that used to be on the OpenSolaris portal by Sun's gone dark, 
but you may have some luck through archive.org.  I know when I tried it out for 
myself using the "Community Edition" of Solaris, I did feel annoyed by the 
bait-and-switch, and the RAM requirements to run de-dupe with merely adequate 
performance were staggering if I wanted to have plenty of spare block cache 
left over for improving performance overall.

Sun left some of the FOSS operating systems a poison pill with its CDDL 
licence, which is the main reason why the implementations of zfs on Linux are 
immature and is being "re-implemented" with US DOE sponsorship, ostensibly in a 
GNU compatible licence.

zfs reminds me a great deal of TIFF - lots of great ideas in the "White Paper", 
but an elusive (or very very costly) white elephant to acquire.  "Rapidly 
changing", "bleeding edge", and "hot & new" are not descriptors for filesystems 
I want to trust more than a token amount of data to.

=R=


Re: [Dovecot] dsync redesign

2012-03-29 Thread Stan Hoeppner
On 3/28/2012 3:54 PM, Jeff Gustafson wrote:
> On Wed, 2012-03-28 at 11:07 -0500, Stan Hoeppner wrote:
> 
>> Locally attached/internal/JBOD storage typically offers the best
>> application performance per dollar spent, until you get to things like
>> backup scenarios, where off node network throughput is very low, and
>> your backup software may suffer performance deficiencies, as is the
>> issue titling this thread.  Shipping full or incremental file backups
>> across ethernet is extremely inefficient, especially with very large
>> filesystems.  This is where SAN arrays with snapshot capability come in
>> really handy.
> 
>   I'm a new employee at the company. I was a bit surprised they were not
> using iSCSI. They claim they just can't risk the extra latency. I

The tiny amount of extra latency using a software initiator is a non
argument for a mail server workload, unless the server is undersized for
the workload--high CPU load and low memory constantly.  As I said, in
that case you drop in an iSCSI HBA and eliminate any possibility of
block latency.

> believe that you are right. It seems to me that offloading snapshots and
> backups to an iSCSI SAN would improve things. 

If you get the right unit you won't understand how you ever lived
without it.  The snaps complete transparently, and the data is on the
snap LUN within a few minutes, depending on the priority you give to
internal operations, snaps/rebuilds/etc, vs external IO requests.
Depending on model

> The problem is that this
> company has been burned on storage solutions more than once and they are
> a little skeptical that a product can scale to what they need. There are

More than once?  More than once??  Hmm...

> some SAN vendor names that are a four letter word here. So far, their
> newest FC SAN is performing well.

Interesting.  Care to name them (off list)?

>   I think having more, small, iSCSI boxes would be a good solution. One
> problem I've seen with smaller iSCSI products is that feature sets like
> snapshotting are not the best implementation. It works, but doing any
> sort of automation can be painful.

As is most often the case, you get what you pay for.

>> The snap takes place wholly within the array and is very fast, without
>> the problems you see with host based snapshots such as with Linux LVM,
>> where you must first freeze the filesystem, wait for the snapshot to
>> complete, which could be a very long time with a 1TB FS.  While this
>> occurs your clients must wait or timeout while trying to access
>> mailboxes.  With a SAN array snapshot system this isn't an issue as the
>> snap is transparent to hosts with little or no performance degradation
>> during the snap.  Two relatively inexpensive units that have such
>> snapshot capability are:
> 
>   How does this work? I've always had Linux create a snapshot. Would the
> SAN doing a snapshot without any OS buy-in cause the filesystem to be
> saved in an inconsistent state? I know that ext4 is pretty good at
> logging, but still, wouldn't this be a problem?

Instead of using "SAN" as a generic term for a "box", which it is not,
please use the terms "SAN" for "storage area network", "SAN array" or
"SAN controller" when talking about a box with or without disks that
performs the block IO shipping and other storage functions, "SAN switch"
for a fiber channel switch, or ethernet switch dedicated to the SAN
infrastructure.  The acronym "SAN" is an umbrella covering many
different types of hardware and network topologies.  It drives me nuts
when people call a fiber channel or iSCSI disk array a "SAN".  These can
be part of a SAN, but are not themselves, a SAN.  If they are direct
connected to a single host they are simple disk arrays, and the word
"SAN" isn't relevant.  Only uneducated people, or those who simply don't
care to be technically correct, call a single intelligent disk box a
"SAN".  Ok, end rant on "SAN".

Read this primer from Dell:
http://files.accord.com.au/EQL/Docs/CB109_Snapshot_Basic.pdf

The snapshots occur entirely at the controller/disk level inside the
box.  This is true of all SAN units that offer snap ability.  No host OS
involvement at all in the snap.  As I previously said, It's transparent.
 Snaps are filesystem independent, and are point-in-time, or PIT copies
of one LUN to another.  Read up on "LUN" if you're not familiar with the
term.  Everything in SAN storage is based on LUNs.

Now, as the document above will tell you, array based snapshots may or
may not be a total backup solution for your environment.  You need to
educate yourself and see if this technology is a feature that fits your
file backup and disaster avoidance and recovery needs.

>> http://www.equallogic.com/products/default.aspx?id=10613
>>
>> http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241493.html
>>
>> The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be
>> had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS.  Each off

Re: [Dovecot] dsync redesign

2012-03-28 Thread Timo Sirainen
On 23.3.2012, at 23.25, Timo Sirainen wrote:

> and even if you don't understand that, here's another document disguising as 
> an algorithm class problem :) If anyone has thoughts on how to solve it, 
> would be great:
> 
> http://dovecot.org/tmp/dsync-redesign-problem.txt
> 
> It only deals with saving new messages, not expunges/flag changes/etc, but 
> those should be much simpler.

Step #3 was more difficult than I first realized. I spent last two days 
figuring out a way to make it work, and looks like I finally did. I didn't 
update the document yet, but I wrote a test program: 
http://dovecot.org/tmp/test-dsync.c

Step #2 should be easy enough.

Step #4 I think I'll forget about and just implement a per-mailbox dsync lock. 
The main reason I wanted to get rid of locks was because a per-user lock can't 
work with shared mailboxes. But a per-mailbox lock is okay enough. Note that #3 
allows the two dsyncs to run in parallel and send duplicate changes, just not 
modifying the same mailbox at the same time (which would duplicate mails due to 
two transactions adding the same mails).



Re: [Dovecot] dsync redesign

2012-03-28 Thread Timo Sirainen
On 27.3.2012, at 1.14, Michescu Andrei wrote:

> This being said and acknowledged here are my 2 cents:
> 
> I think that the current '1 brain / 2 workers' seems to be the correct
> model. The "the client" connects to the "server" and pushes the local
> changes and after retrieves the updated/new items from the "server". "The
> brain" considers first server as the "local storage" and the second server
> as "server storage".

This design makes it too easy to design it in a way that adds extra roundtrips 
= extra latency. It also kind of hides other problems as well. For example now 
dsync can way too easily just fail if something unexpected happens during dsync 
(e.g. mailbox gets renamed/deleted). And there are of course some bugs that I 
don't really understand why some people are seeing them at all.

> For the split design, "come to the same conclusion of the state" is very
> race-condition prone.

It's race-condition prone with the brain design as well. dsync can't just lock 
the mailbox during its sync, since the sync can take a long time. With a 
"brainless" design it's clear from the beginning that there are race conditions 
and they need to be dealt with.

Re: [Dovecot] dsync redesign

2012-03-28 Thread Jeff Gustafson
On Wed, 2012-03-28 at 11:07 -0500, Stan Hoeppner wrote:

> Locally attached/internal/JBOD storage typically offers the best
> application performance per dollar spent, until you get to things like
> backup scenarios, where off node network throughput is very low, and
> your backup software may suffer performance deficiencies, as is the
> issue titling this thread.  Shipping full or incremental file backups
> across ethernet is extremely inefficient, especially with very large
> filesystems.  This is where SAN arrays with snapshot capability come in
> really handy.

I'm a new employee at the company. I was a bit surprised they were not
using iSCSI. They claim they just can't risk the extra latency. I
believe that you are right. It seems to me that offloading snapshots and
backups to an iSCSI SAN would improve things. The problem is that this
company has been burned on storage solutions more than once and they are
a little skeptical that a product can scale to what they need. There are
some SAN vendor names that are a four letter word here. So far, their
newest FC SAN is performing well. 
I think having more, small, iSCSI boxes would be a good solution. One
problem I've seen with smaller iSCSI products is that feature sets like
snapshotting are not the best implementation. It works, but doing any
sort of automation can be painful.

> The snap takes place wholly within the array and is very fast, without
> the problems you see with host based snapshots such as with Linux LVM,
> where you must first freeze the filesystem, wait for the snapshot to
> complete, which could be a very long time with a 1TB FS.  While this
> occurs your clients must wait or timeout while trying to access
> mailboxes.  With a SAN array snapshot system this isn't an issue as the
> snap is transparent to hosts with little or no performance degradation
> during the snap.  Two relatively inexpensive units that have such
> snapshot capability are:

How does this work? I've always had Linux create a snapshot. Would the
SAN doing a snapshot without any OS buy-in cause the filesystem to be
saved in an inconsistent state? I know that ext4 is pretty good at
logging, but still, wouldn't this be a problem?

> 
> http://www.equallogic.com/products/default.aspx?id=10613
> 
> http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241493.html
> 
> The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be
> had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS.  Each offer 4 or
> more host/network connection ports when equipped with dual controllers.
>  There are many other vendors with similar models/capabilities.  I
> mention these simply because Dell/HP are very popular and many OPs are
> already familiar with their servers and other products.

I will take a look. I might have some convincing to do. 


> There are 3 flavors of ZFS:  native Oracle Solaris, native FreeBSD,
> Linux FUSE.  Which were you using?  If the last, that would fully
> explain the suck.

There is one more that I had never used before coming on board here:
ZFSonLinux. ZFSonLinux is a real kernel level fs plugin. My
understanding is that they were using it on the backup machines with the
front end dovecot machines using ext4. I'm told the metadata issue is a
ZFS thing and they have the same problem on Solaris/Nexenta. 

 
> > I've relatively new here, but I'll ask around about XFS and see if
> > anyone had tested it in the development environment.
> 
> If they'd tested it properly, and relatively recently, I would think
> they'd have already replaced EXT4 on your Dovecot server.  Unless others
> factors prevented such a migration.  Or unless I've misunderstood the
> size of your maildir workload.

I don't know the entire history of things. I think they really wanted
to use ZFS for everything and then fell back to ext4 because it
performed well enough in the cluster. Performance becomes an issue with
backups using rsync. Rsync is faster than Dovecot's native dsync by a
very large margin. I know that dsync is doing more than rsync, but
still, seconds compared to over five minutes? That is a significant
difference. The problem is that rsync can't get a perfect backup.

...Jeff




Re: [Dovecot] dsync redesign

2012-03-28 Thread Stan Hoeppner
On 3/27/2012 3:57 PM, Jeff Gustafson wrote:

>   We do have a FC system that another department is using. The company
> dropped quite a bit of cash on it for a specific purpose. Our department
> does not have access it to.  People are somewhat afraid of iSCSI around
> here because they believe it will add too much latency to the overall IO
> performance. They're a big believer in locally attached disks. Less
> features, but very good performance.

If you use a software iSCSI initiator with standard GbE ports, block IO
latency can become a problem, but basically in only 3 scenarios:

1.  Slow CPUs or not enough CPUs/cores.  This is unlikely to be a
problem in 2012, given the throughput of today's multi-core CPUs.  Low
CPU throughput hasn't generally been the cause of software iSCSI
initiator latency problems since pre-2007/8 with most applications.  I'm
sure some science/sim apps that tax both CPU and IO may have still had
issues.  Those would be prime candidates for iSCSI HBAs.

2.  An old OS kernel that doesn't thread IP stack, SCSI encapsulation,
and/or hardware interrupt processing amongst all cores.  Recent Linux
kernels do this rather well, especially with MSI-X enabled, older ones
not so well.  I don't know about FreeBSD, Solaris, AIX, HP-UX, Windows, etc.

3.  System under sufficiently high CPU load to slow IP stack and iSCSI
encapsulation processing, and or interrupt handling.  Again, with
today's multi-core fast CPUs this probably isn't going to be an issue,
especially given that POP/IMAP are IO latency bound, not CPU bound.
Most people running Dovecot today are going to have plenty of idle CPU
cycles to perform the additional iSCSI initiator and TCP stack
processing without introducing undue block IO latency effects.

As always, YMMV.  The simply path is to acquire your iSCSI SAN array and
use software initiators on client hosts.  In the unlikely event you do
run into block IO latency issues, you simply drop an iSCSI HBA into each
host suffering the latency.  They run ~$700-900 USD each for single port
models, and they eliminate block IO latency completely, which is one
reason they cost so much.  They have an onboard RISC chip and memory
doing the TCP and SCSI encapsulation processing.  They also give you the
ability to boot diskless servers from LUNs on the SAN array.  This is
very popular with blade server systems, and I've done this many times
myself, albeit with fibre channel HBAs/SANs, not iSCSI.

Locally attached/internal/JBOD storage typically offers the best
application performance per dollar spent, until you get to things like
backup scenarios, where off node network throughput is very low, and
your backup software may suffer performance deficiencies, as is the
issue titling this thread.  Shipping full or incremental file backups
across ethernet is extremely inefficient, especially with very large
filesystems.  This is where SAN arrays with snapshot capability come in
really handy.

The snap takes place wholly within the array and is very fast, without
the problems you see with host based snapshots such as with Linux LVM,
where you must first freeze the filesystem, wait for the snapshot to
complete, which could be a very long time with a 1TB FS.  While this
occurs your clients must wait or timeout while trying to access
mailboxes.  With a SAN array snapshot system this isn't an issue as the
snap is transparent to hosts with little or no performance degradation
during the snap.  Two relatively inexpensive units that have such
snapshot capability are:

http://www.equallogic.com/products/default.aspx?id=10613

http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241493.html

The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be
had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS.  Each offer 4 or
more host/network connection ports when equipped with dual controllers.
 There are many other vendors with similar models/capabilities.  I
mention these simply because Dell/HP are very popular and many OPs are
already familiar with their servers and other products.

>   We thought ZFS would provide us with a nice snapshot and backup system
> (with zfs send). We never got that far once we discovered that ZFS
> doesn't work very well in this context. Running rsync on it gave us
> terrible performance.

There are 3 flavors of ZFS:  native Oracle Solaris, native FreeBSD,
Linux FUSE.  Which were you using?  If the last, that would fully
explain the suck.

>> Also, you speak of a very large maildir store, with hundreds of
>> thousands of directories, obviously many millions of files, of 1TB total
>> size.  Thus I would assume you have many thousands of users, if not 10s
>> of thousands.
>>
>> It's a bit hard to believe you're not running XFS on your storage, given
>> your level of parallelism.  You'd get much better performance using XFS
>> vs EXT4.  Especially with kernel 2.6.39 or later which includes the
>> delayed logging patch.  This patch increases m

Re: [Dovecot] dsync redesign

2012-03-27 Thread Jeff Gustafson
On Tue, 2012-03-27 at 15:09 -0500, Stan Hoeppner wrote:
> On 3/26/2012 2:34 PM, Jeff Gustafson wrote:
> 
> > Do you have any suggestions for a distributed replicated filesystem
> > that works well with dovecot? I've looked into glusterfs, but the
> > latency is way too high for lots of small files. They claim this problem
> > is fixed in glusterfs 3.3. NFS too slow for my installation so I don't
> > see how any of the distributed filesystems would help me. I've also
> > tried out ZFS, but it appears to have issues with metadata look ups with
> > directories that have tens or hundreds of thousands of files in them.
> > For me, the best filesystem is straight up ext4 running on locally
> > attached storage. 
> 
> It sounds like you're in need of a more robust and capable
> storage/backup solution, such as an FC/iSCSI SAN array with PIT and/or
> incremental snapshot capability.

We do have a FC system that another department is using. The company
dropped quite a bit of cash on it for a specific purpose. Our department
does not have access it to. People are somewhat afraid of iSCSI around
here because they believe it will add too much latency to the overall IO
performance. They're a big believer in locally attached disks. Less
features, but very good performance. 
We thought ZFS would provide us with a nice snapshot and backup system
(with zfs send). We never got that far once we discovered that ZFS
doesn't work very well in this context. Running rsync on it gave us
terrible performance.

> Also, you speak of a very large maildir store, with hundreds of
> thousands of directories, obviously many millions of files, of 1TB total
> size.  Thus I would assume you have many thousands of users, if not 10s
> of thousands.
> 
> It's a bit hard to believe you're not running XFS on your storage, given
> your level of parallelism.  You'd get much better performance using XFS
> vs EXT4.  Especially with kernel 2.6.39 or later which includes the
> delayed logging patch.  This patch increases metadata write throughput
> by a factor of 2-50+ depending on thread count, and decreases IOPS and
> MB/s hitting the storage by about the same factor, depending on thread
> count.

I've relatively new here, but I'll ask around about XFS and see if
anyone had tested it in the development environment. 

...Jeff



Re: [Dovecot] dsync redesign

2012-03-27 Thread Stan Hoeppner
On 3/26/2012 2:34 PM, Jeff Gustafson wrote:

>   Do you have any suggestions for a distributed replicated filesystem
> that works well with dovecot? I've looked into glusterfs, but the
> latency is way too high for lots of small files. They claim this problem
> is fixed in glusterfs 3.3. NFS too slow for my installation so I don't
> see how any of the distributed filesystems would help me. I've also
> tried out ZFS, but it appears to have issues with metadata look ups with
> directories that have tens or hundreds of thousands of files in them.
> For me, the best filesystem is straight up ext4 running on locally
> attached storage. 
>   I think a solid, fast dsync implementation would be very useful for a
> large installation.

It sounds like you're in need of a more robust and capable
storage/backup solution, such as an FC/iSCSI SAN array with PIT and/or
incremental snapshot capability.

Also, you speak of a very large maildir store, with hundreds of
thousands of directories, obviously many millions of files, of 1TB total
size.  Thus I would assume you have many thousands of users, if not 10s
of thousands.

It's a bit hard to believe you're not running XFS on your storage, given
your level of parallelism.  You'd get much better performance using XFS
vs EXT4.  Especially with kernel 2.6.39 or later which includes the
delayed logging patch.  This patch increases metadata write throughput
by a factor of 2-50+ depending on thread count, and decreases IOPS and
MB/s hitting the storage by about the same factor, depending on thread
count.

Before this patch XFS sucked at the write portion of the maildir
workload due to the extremely high IOPS and MB/s hitting just the log
journal, not including the actual file writes.  It's parallel maildir
read performance was better than any other, but the write was so bad it
bogged down the storage producing high latency for everything.  With the
delaylog patch, XFS now trounces every filesystem at medium to high
parallelism levels.  Delaylog was introduced in mid 2009, included in
2.6.35 as experimental, and is the default in 2.6.39 and later.  If
you're a Red Hat or CentOS user it's included in 6.2.

This one patch, which was 5+ years in development, dramatically changed
the character of XFS with this class of metadata intensive parallel
workloads.  Many people with such a workload who ran from XFS in the
past, as if it were the Fukushima reactor, are now adopting it in droves.

What a difference a few hundred lines of very creative code can make...

-- 
Stan


Re: [Dovecot] dsync redesign

2012-03-27 Thread Charles Marcus

On 2012-03-27 11:47 AM, Micah Anderson  wrote:

One would be the ability to perform *intelligent* incremental /
rotated backups. I can do this now by running a dsync backup
operation and then doing manual hardlinking or moving of the backup
directories (daily.1, daily.2, weekly.1, monthly.1, etc.), but it
would be more intelligent if this were baked into the backup process.


There are already numerous tools that do this flawlessly - I've been 
using rsnapshot (which uses rsync) for this for years.


I don't know if Timo should be spending his time reinventing the wheel.

I'm much more interested in dsync working flawlessly to keep one or more 
secondary servers in sync, and leave backups to backup software.



Lastly, there isn't a good method for restoring backups. I can reverse
the backup process, onto the user's "live" mailbox, but that brings the
user into an undesirable state (eg. their mailbox state one day
ago). Better would be if their backup could be restored in such a way
that the user can resolve the missing pieces manually, as they know
best.


Again, best left to the backup software I think?

Although, one interesting piece that I am hopeful I'll be able to 
implement soon (with Timo's professional help) is the ability to easily 
and automatically map my rsnapshot snapshots directory to a read-only 
'Backups' namespace that automatically shows the snapshots by date and 
time as they are produced. This way users could 'go back in time' 
anytime they wanted without having to call me... :)



thanks again for your work on this, from my position dovecot is an
amazing piece of software, the only part that seems to have some issues
is dsync and I applaud the effort to redesign to fix things!


Ditto all of that! :)

--

Best regards,

Charles


Re: [Dovecot] dsync redesign

2012-03-27 Thread Micah Anderson
Timo Sirainen  writes:

> In case anyone is interested in reading (and maybe helping!) with a dsync 
> redesign that's intended to fix all of its current problems, here are some 
> possibly incoherent ramblings about it:

thank you for opening this discussion about dsync!

besides the problems I've encountered with dsync, there are a couple
things that I think would be great to build into the new vision of the
protocol. 

One would be the ability to perform *intelligent* incremental/rotated
backups. I can do this now by running a dsync backup operation and then
doing manual hardlinking or moving of the backup directories (daily.1,
daily.2, weekly.1, monthly.1, etc.), but it would be more intelligent if
this were baked into the backup process.

Secondly, being able to filter out mailboxes could result in much more
efficient syncing. Now there is the capability to operate on only
specific mailboxes, but this doesn't scale well when I am trying to
backup thousands of users and I want to omit the Spam and Trash folders
from the sync. I would have to get a mailbox list of each user, and then
iterate over each mailbox for each user, skipping the Spam and Trash
folders, forking a new 'dsync backup' for each of their mailboxes, for
each user.

Lastly, there isn't a good method for restoring backups. I can reverse
the backup process, onto the user's "live" mailbox, but that brings the
user into an undesirable state (eg. their mailbox state one day
ago). Better would be if their backup could be restored in such a way
that the user can resolve the missing pieces manually, as they know
best.

thanks again for your work on this, from my position dovecot is an
amazing piece of software, the only part that seems to have some issues
is dsync and I applaud the effort to redesign to fix things!

micah



Re: [Dovecot] dsync redesign

2012-03-26 Thread Michescu Andrei
Hello Timo,

Thank you very much for planning a redesign of the dsycn and for opening
this discussion.

As I can see from the replies that came until now everybody misses the
main point of IMAP: IMAP has been designed to work as a disconnected,
high-latency data store.

To make this more clear: once and IMAP client finishes the synchronization
with the server, both have client and server have a consistent state of
the mailbox. After this both the "client" and the "server" act like master
for their own local copy (on the "server" new emails get created etc, on
the "client" existing emails get changed (flags) and moved, and new emails
appear (sent items)).

So the protocol is designed, originally, to handle the master-master
replication. And as this it make sense a deployment global-wide, where
servers work independently and from time to time they "merge" the changes.

This being said and acknowledged here are my 2 cents:

I think that the current '1 brain / 2 workers' seems to be the correct
model. The "the client" connects to the "server" and pushes the local
changes and after retrieves the updated/new items from the "server". "The
brain" considers first server as the "local storage" and the second server
as "server storage".

For the split design, "come to the same conclusion of the state" is very
race-condition prone.

As long as the algorithm is kept as you described it in the original
document then the backups should really be incremental (because you only
do the changes since last sync).

As the most changes are "metadata-only" the sync can be pretty fast by
merging indexes.

Thank you,
Andrei


> In case anyone is interested in reading (and maybe helping!) with a dsync
> redesign that's intended to fix all of its current problems, here are some
> possibly incoherent ramblings about it:
>
> http://dovecot.org/tmp/dsync-redesign.txt
>
> and even if you don't understand that, here's another document disguising
> as an algorithm class problem :) If anyone has thoughts on how to solve
> it, would be great:
>
> http://dovecot.org/tmp/dsync-redesign-problem.txt
>
> It only deals with saving new messages, not expunges/flag changes/etc, but
> those should be much simpler.
>
>
> !DSPAM:4f6cea4c260302917022693!
>
>




Re: [Dovecot] dsync redesign

2012-03-26 Thread Jeff Gustafson
On Sat, 2012-03-24 at 08:19 +0100, Attila Nagy wrote:

> 
> I personally think that Dovecot could gain much more if the amount of 
> work going into fixing or improving dsync would go into making Dovecot 
> to (be able of) use a high scale, distributed storage backend.
> I know it's much harder, because there are several major differences 
> compared to the "low latency" and consistency problem free local file 
> systems, but its fruits are also sweeter for the long term. :)

Do you have any suggestions for a distributed replicated filesystem
that works well with dovecot? I've looked into glusterfs, but the
latency is way too high for lots of small files. They claim this problem
is fixed in glusterfs 3.3. NFS too slow for my installation so I don't
see how any of the distributed filesystems would help me. I've also
tried out ZFS, but it appears to have issues with metadata look ups with
directories that have tens or hundreds of thousands of files in them.
For me, the best filesystem is straight up ext4 running on locally
attached storage. 
I think a solid, fast dsync implementation would be very useful for a
large installation. 

...Jeff



Re: [Dovecot] dsync redesign

2012-03-24 Thread Timo Sirainen
On 24.3.2012, at 9.19, Attila Nagy wrote:

> Well, dsync is a very useful tool, but with continuous replication it tries 
> to solve a problem which should be handled -at least partially- elsewhere. 
> Storing stuff in plain file systems and duplicating them to another one just 
> doesn't scale.

dsync solves several other problems besides replication. Even if Dovecot had a 
super efficient replicated storage, dsync would still exist for doing things 
like:

 - migrating between mailbox formats
 - migrating from other imap/pop3 servers
 - creating (incremental) backups
 - the redesign works great for super-high latency replication (USB sticks, 
cross-planet replication :)
 - and when you really just don't want any kind of a complex replicated 
database, just something simple

So I'll need to get this working well in any case. And with the redesign the 
replication should be efficient enough to scale pretty well.

> I personally think that Dovecot could gain much more if the amount of work 
> going into fixing or improving dsync would go into making Dovecot to (be able 
> of) use a high scale, distributed storage backend.
> I know it's much harder, because there are several major differences compared 
> to the "low latency" and consistency problem free local file systems, but its 
> fruits are also sweeter for the long term. :)

Yes, I'm also planning on implementing that, but not yet.

> It would bring Dovecot into the class of open source mail servers where there 
> are currently no contenders.
> 
> BTW, for the previous question in this topic (are there any nosql dbs 
> supporting application-level conflict resolution?), there are similar 
> solutions (like CouchDB, but having some experiences with it, I wouldn't 
> recommend it for massive mail storage -at least the plain CouchDB product), 
> but I guess you would be better off with designing a schema which doesn't 
> need it at the first time.
> For example, messages are immutable, so you won't face this issue in this 
> area.
> And for metadata, maybe the solution is not to store "digested" snapshots of 
> the current metadata (folders, flags, message links for folders etc), but to 
> store the changes happening on the user's mailbox and occasionally aggregate 
> them into a last known good and consistent state.

My plan was to create similar index files as currently exists in filesystem. It 
would work pretty much the same as you described: There's a "log" where changes 
are appended, and once in a while the changes are written into an "index" 
snapshot. When reading you first read the snapshot and then apply new changes 
from the log. The conflict resolution if DB supports it would work by reading 
the two logs in parallel and figure out a way to merge them consistently, 
similar to how dsync does pretty much the same thing. Hmm. Perhaps the metadata 
log could exist exactly as the dsync data format and have dsync code do the 
merging?..

> Also, there are other interesting ideas, maybe with real single instance 
> store (splitting mime parts? Storing attachments in plain binary form? This 
> always brings up the question of whether the mail server should modify the 
> mails, can be pretty bad for encrypted/signed stuff).

This is already optionally done in v2.0+dbox. MIME attachments can be stored in 
plain binary form if they can be reconstructed back into their original form. 
It doesn't break any signed stuff.

Re: [Dovecot] dsync redesign

2012-03-24 Thread Jan-Frode Myklebust
On Sat, Mar 24, 2012 at 08:19:48AM +0100, Attila Nagy wrote:
> On 03/23/12 22:25, Timo Sirainen wrote:
> >
> Well, dsync is a very useful tool, but with continuous replication
> it tries to solve a problem which should be handled -at least
> partially- elsewhere. Storing stuff in plain file systems and
> duplicating them to another one just doesn't scale.

I don't see why this shouldn't scale. Mailboxes are after all changed
relatively infrequently. One idea for making it more scalable might be
to treat indexes/metadata and messages differently. Make index/metadata
updates synchronous over the clusters/locations (with re-sync capability
in case of lost synchronisation), while messages are store in one
"altstorage" per cluster/location.

For a two-location solution, message-data should be stored in:

mail_location = mdbox:~/mdbox
ALTcache=mdbox:~/mdbox-remoteip-cache
ALT=dfetch://remoteip/   <-- new protocol

If a message is in the index, look for it in that order:

local mdbox
ALTcache
ALT

if it finds the message in ALT, make a copy into ALTcache (or local
mdbox?).

Syncronizing messages could be a very low frequency job, and could be
handled by simple rsync of ALT to ALTcache. No need for specialized tool
for this job. Syncronizing ALTcache to local mdbox could be done with a
reversed doveadm-altmove, but might not be necessary.

Of course this is probably all very naive.. but you get the idea :-)


   -jf


Re: [Dovecot] dsync redesign

2012-03-24 Thread Attila Nagy

On 03/23/12 22:25, Timo Sirainen wrote:

In case anyone is interested in reading (and maybe helping!) with a dsync 
redesign that's intended to fix all of its current problems, here are some 
possibly incoherent ramblings about it:

http://dovecot.org/tmp/dsync-redesign.txt

and even if you don't understand that, here's another document disguising as an 
algorithm class problem :) If anyone has thoughts on how to solve it, would be 
great:

http://dovecot.org/tmp/dsync-redesign-problem.txt

It only deals with saving new messages, not expunges/flag changes/etc, but 
those should be much simpler.

Well, dsync is a very useful tool, but with continuous replication it 
tries to solve a problem which should be handled -at least partially- 
elsewhere. Storing stuff in plain file systems and duplicating them to 
another one just doesn't scale.


I personally think that Dovecot could gain much more if the amount of 
work going into fixing or improving dsync would go into making Dovecot 
to (be able of) use a high scale, distributed storage backend.
I know it's much harder, because there are several major differences 
compared to the "low latency" and consistency problem free local file 
systems, but its fruits are also sweeter for the long term. :)


It would bring Dovecot into the class of open source mail servers where 
there are currently no contenders.


BTW, for the previous question in this topic (are there any nosql dbs 
supporting application-level conflict resolution?), there are similar 
solutions (like CouchDB, but having some experiences with it, I wouldn't 
recommend it for massive mail storage -at least the plain CouchDB 
product), but I guess you would be better off with designing a schema 
which doesn't need it at the first time.
For example, messages are immutable, so you won't face this issue in 
this area.
And for metadata, maybe the solution is not to store "digested" 
snapshots of the current metadata (folders, flags, message links for 
folders etc), but to store the changes happening on the user's mailbox 
and occasionally aggregate them into a last known good and consistent state.
Also, there are other interesting ideas, maybe with real single instance 
store (splitting mime parts? Storing attachments in plain binary form? 
This always brings up the question of whether the mail server should 
modify the mails, can be pretty bad for encrypted/signed stuff).


And of course there is always the problem of designing a good, 
consistent method which is also efficient.