Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-12 Thread Bron Gondwana
On Fri, Nov 12, 2010 at 05:10:20PM -0200, Sergio Bruder wrote:
> We saw something similar:
> 
> syslog() messages 'on the wire' (imap, pop3, etcetera) when We've 
> restarted syslog on an in-production cyrus backend.
> 
> In summary, DONT DO IT (syslog stop) with cyrus runing.

Ooh, that's interesting actually.  How old is your data?
What platform?  Can you please create a bugzilla entry
for it, particularly if you can reproduce!

Thanks,

Bron ( actually using Bugzilla now! )

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-12 Thread Sergio Bruder
We saw something similar:

syslog() messages 'on the wire' (imap, pop3, etcetera) when We've 
restarted syslog on an in-production cyrus backend.

In summary, DONT DO IT (syslog stop) with cyrus runing.


On 11/11/2010 07:54 PM, Bron Gondwana wrote:
> On Thu, Nov 11, 2010 at 02:24:47PM -0200, Henrique de Moraes Holschuh wrote:
>> On Thu, 11 Nov 2010, Paul Dekkers wrote:
>>> Uhoh! And then I looked at mailboxes.db: It looks like part completely
>>> rewritten, including the skiplist header, and the first line now said:
>>> user.bla: System I/O error System I/O error
>> This is something that has plagued cyrus for a long time.  Can we find a
>> way to actually keep tabs on our FDs so it cannot ever happen again,
>> please?  I recall reports of crap showing inside prot streams 10 years
>> ago... if now it is leaking into even worse places, well...
> It's a standalone program.  Reconstruct was running all by itself.
>
>> This probably needs a redesign of master/service fd-passing protocol,
>> and of prot streams to be fixed for good.   While at it, we should
>> switch the master/service interaction to a modern design, since the
>> operating system worth bothering with nowadays deal sanely with the
>> thundering herd effect, and all of them have proper socket event support
>> (epoll-like. Would require one of the event abstraction libraries,
>> though, so as to support linux/bsd/solaris with minimum fuss).
> Since that wasn't the issue - why on earth was it allowed to have fd 2
> in the first place?  Is Cyrus closing fd 2, or is truss closing it??
>
> There was no issue outside truss, it was when it ran under truss that
> the issue happened.
>
> Here's the start of an strace of a reconstruct run on my machine:
>
> execve("/usr/cyrus/bin/reconstruct", ["/usr/cyrus/bin/reconstruct", "-C", 
> "/tmp/ct-slot2/etc/imapd.conf", "-s"], [/* 20 vars */]) = 0
> brk(0)  = 0x12f1000
> access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or 
> directory)
> mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
> 0x7fceb52d8000
> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or 
> directory)
> open("db-4.6/lib/tls/x86_64/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such 
> file or directory)
> open("db-4.6/lib/tls/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such file or 
> directory)
> open("db-4.6/lib/x86_64/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such file 
> or directory)
> open("db-4.6/lib/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such file or 
> directory)
> open("/etc/ld.so.cache", O_RDONLY)  = 3
>
>
> Notice the first fd allocated: 3.
>
> And here's a run under truss on FreeBSD:
>
> [r...@cyrus1 /var/imap]# sudo -u cyrus truss /usr/local/cyrus/bin/reconstruct 
> user.foo
> __sysctl(0x7fffe390,0x2,0x7fffe3ac,0x7fffe3a0,0x0,0x0) = 0 (0x0)
> mmap(0x0,672,PROT_READ|PROT_WRITE,MAP_ANON,-1,0x0) = 34366398464 (0x80065a000)
> munmap(0x80065a000,672)= 0 (0x0)
> __sysctl(0x7fffe400,0x2,0x800763428,0x7fffe3f8,0x0,0x0) = 0 (0x0)
> mmap(0x0,32768,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 
> 34366398464 (0x80065a000)
> issetugid(0x80065b015,0x800654cc4,0x80076fc50,0x80076fc20,0x6351,0x0) = 0 
> (0x0)
> open("/etc/libmap.conf",O_RDONLY,0666) ERR#2 'No such file or 
> directory'
> access("/usr/lib/libsasl2.so.2",0) ERR#2 'No such file or directory'
> access("/usr/local/lib/libsasl2.so.2",0) = 0 (0x0)
> open("/usr/local/lib/libsasl2.so.2",O_RDONLY,035431400) = 2 (0x2)
>
> Note the first fd allocated: 2!
>
>
> The question is - why is fd 2 being allocated?  Is it necessary to explicitly
> open stderr?  The function that's scribbling all over everything is com_err,
> which is supposed to be a BSD error reporting library, it SHOULD know what
> it's doing...
>
> Bron ( a while later, fd 2 gets re-used as the mailboxes.db handle, and hence
> the mess is created )
> 
> Cyrus Home Page: http://www.cyrusimap.org/
> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-12 Thread Henrique de Moraes Holschuh
On Fri, 12 Nov 2010, Bron Gondwana wrote:
> On Thu, Nov 11, 2010 at 11:58:04PM -0200, Henrique de Moraes Holschuh wrote:
> > It _will_ write to stderr (aka fd 2).  If we want to be safe, we make sure
> > fds 0-2 are sane, and we check when we open sockets/files that we did not
> > get fds below 3...
> > 
> > > Bron ( a while later, fd 2 gets re-used as the mailboxes.db handle, and 
> > > hence
> > >the mess is created )
> > 
> > Indeed.
> > 
> > We *CANNOT* afford to have any files or sockets opened with fd 0, 1 or 2. We
> > should core-dump immediately if that happens, I think.
> 
> How about this skanky patch (attached?) - checks fds at the start, and if it
> gets fd 2, it holds it open (to /dev/null) for the life of the process, making
> sure nothing else gets it.  If it gets 0 or 1 it just croaks.

IMHO we should be even more paranoid.  Create a xfopen() that we use
anywhere where we are not explicitly dealing with fd 0, 1 or 2.  If xfopen()
detects the problem, it should dump info to syslog priority error and
request that the report is sent to us.  And, of course, return -1 (or, if
you prefer, heal the fds).

That should help us find out why sometimes one of those fds just disappear.
I don't think it is just a case of truss doing something idiotic, we have
had spurious (and _rare_) problem reports where it clearly happened over the
years...

Maybe the truss issue is FreeBSD breakage, but the weirdness has happened on
Linux as well in the past.

> From 5a6433511db0002227aad069ee9e92c34932879a Mon Sep 17 00:00:00 2001
> From: Bron Gondwana 
> Date: Fri, 12 Nov 2010 14:05:02 +1100
> Subject: [PATCH] Protect STDERR on FreeBSD
> 
> ---
>  lib/libcyr_cfg.c |   20 
>  1 files changed, 20 insertions(+), 0 deletions(-)
> 
> diff --git a/lib/libcyr_cfg.c b/lib/libcyr_cfg.c
> index 83a376c..d8c6986 100644
> --- a/lib/libcyr_cfg.c
> +++ b/lib/libcyr_cfg.c
> @@ -59,6 +59,8 @@
>  #define CFGVAL(t,v)  {(void *)(v)}
>  #endif
>  
> +static int protect_stderr = -1;
> +
>  struct cyrusopt_s cyrus_options[] = {
>  { CYRUSOPT_ZERO, { NULL }, CYRUS_OPT_NOTOPT },
>  
> @@ -221,10 +223,28 @@ void libcyrus_config_setswitch(enum cyrus_opt opt, int 
> val)
>  
>  void libcyrus_init()
>  {
> +protect_stderr = open("/dev/null", O_RDWR, 0666);
> +if (protect_stderr > 2) {
> + /* Ok, we're safe */
> + close(protect_stderr);
> + protect_stderr = -1;
> +}
> +else if (protect_stderr == 2) {
> + syslog(LOG_ERR, "WARNING: Protecting stderr from dangerous re-open. "
> + "Are you running under broken truss on FreeBSD?");
> +}
> +else {
> + abort();
> +}
> +
>  cyrusdb_init();
>  }
>  
>  void libcyrus_done()
>  {
>  cyrusdb_done();
> +if (protect_stderr > -1) {
> + close(protect_stderr);
> + protect_stderr = -1;
> +}
>  }


-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-12 Thread Bron Gondwana
On Fri, Nov 12, 2010 at 08:41:50AM +0100, Per olof Ljungmark wrote:
> File a bug with FreeBSD?
> http://www.freebsd.org/support/bugreports.html

Already did.

http://www.freebsd.org/cgi/query-pr.cgi?pr=152151

Bron.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Per olof Ljungmark
On 11/12/10 00:06, Bron Gondwana wrote:
> On Thu, Nov 11, 2010 at 10:12:31AM +0100, Paul Dekkers wrote:
>> Hmm, allright, so I ran it with a truss (like strace for FreeBSD) to
>> give me a bit more verbosity, and I realized I should chown.
>>
>> But then:
>>
>> # chown cyrus 22003.
>> # sudo -u cyrus /usr/local/cyrus/bin/reconstruct user.bla
>> fatal error: can't read mailboxes file
> 
> Confirmed: bug in FreeBSD's truss.  If you must truss, always
> run it with a specified output file, which should avoid the
> bug.  Otherwise a random file (mailboxes.db in this case) will
> get random stderr junk written to it!

File a bug with FreeBSD?
http://www.freebsd.org/support/bugreports.html

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Bron Gondwana
On Thu, Nov 11, 2010 at 11:58:04PM -0200, Henrique de Moraes Holschuh wrote:
> It _will_ write to stderr (aka fd 2).  If we want to be safe, we make sure
> fds 0-2 are sane, and we check when we open sockets/files that we did not
> get fds below 3...
> 
> > Bron ( a while later, fd 2 gets re-used as the mailboxes.db handle, and 
> > hence
> >the mess is created )
> 
> Indeed.
> 
> We *CANNOT* afford to have any files or sockets opened with fd 0, 1 or 2. We
> should core-dump immediately if that happens, I think.

How about this skanky patch (attached?) - checks fds at the start, and if it
gets fd 2, it holds it open (to /dev/null) for the life of the process, making
sure nothing else gets it.  If it gets 0 or 1 it just croaks.

Bron.
>From 5a6433511db0002227aad069ee9e92c34932879a Mon Sep 17 00:00:00 2001
From: Bron Gondwana 
Date: Fri, 12 Nov 2010 14:05:02 +1100
Subject: [PATCH] Protect STDERR on FreeBSD

---
 lib/libcyr_cfg.c |   20 
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/lib/libcyr_cfg.c b/lib/libcyr_cfg.c
index 83a376c..d8c6986 100644
--- a/lib/libcyr_cfg.c
+++ b/lib/libcyr_cfg.c
@@ -59,6 +59,8 @@
 #define CFGVAL(t,v)	{(void *)(v)}
 #endif
 
+static int protect_stderr = -1;
+
 struct cyrusopt_s cyrus_options[] = {
 { CYRUSOPT_ZERO, { NULL }, CYRUS_OPT_NOTOPT },
 
@@ -221,10 +223,28 @@ void libcyrus_config_setswitch(enum cyrus_opt opt, int val)
 
 void libcyrus_init()
 {
+protect_stderr = open("/dev/null", O_RDWR, 0666);
+if (protect_stderr > 2) {
+	/* Ok, we're safe */
+	close(protect_stderr);
+	protect_stderr = -1;
+}
+else if (protect_stderr == 2) {
+	syslog(LOG_ERR, "WARNING: Protecting stderr from dangerous re-open. "
+			"Are you running under broken truss on FreeBSD?");
+}
+else {
+	abort();
+}
+
 cyrusdb_init();
 }
 
 void libcyrus_done()
 {
 cyrusdb_done();
+if (protect_stderr > -1) {
+	close(protect_stderr);
+	protect_stderr = -1;
+}
 }
-- 
1.7.2.3


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/

Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Henrique de Moraes Holschuh
On Thu, 11 Nov 2010, Gary Mills wrote:
> Isn't the modern design multiple threads, rather than multiple
> processes?  That seems to me to be the right direction for Cyrus.
> It might even make for a simpler design.

Ehh... not realy. Multithreading means locking, futexes, and other pains. It
also means almost rewriting master, etc.  It means VERY painful debugging.

Modern design just means high-performance event dispatching.  Cyrus can
already do pre-fork, so it just needs an update on the connection handling,
you don't need to go multithread.

Besides, fork()-based designs are much more resilient.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Henrique de Moraes Holschuh
On Fri, 12 Nov 2010, Bron Gondwana wrote:
> Since that wasn't the issue - why on earth was it allowed to have fd 2
> in the first place?  Is Cyrus closing fd 2, or is truss closing it??

That is the issue that caused the leaks into protstreams, AFAIK.  It is
always com-err writing to fd 2, and something unexpected being on fd 2.

> open stderr?  The function that's scribbling all over everything is com_err,
> which is supposed to be a BSD error reporting library, it SHOULD know what
> it's doing...

It _will_ write to stderr (aka fd 2).  If we want to be safe, we make sure
fds 0-2 are sane, and we check when we open sockets/files that we did not
get fds below 3...

> Bron ( a while later, fd 2 gets re-used as the mailboxes.db handle, and hence
>the mess is created )

Indeed.

We *CANNOT* afford to have any files or sockets opened with fd 0, 1 or 2. We
should core-dump immediately if that happens, I think.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Gary Mills
On Fri, Nov 12, 2010 at 10:33:15AM +1100, Bron Gondwana wrote:
> Sorry - I've been busy working on the specific problem rather than the
> overview, and I realised I kind of glossed over this bit:
> 
> On Thu, Nov 11, 2010 at 02:24:47PM -0200, Henrique de Moraes Holschuh wrote:
> > This probably needs a redesign of master/service fd-passing protocol,
> > and of prot streams to be fixed for good.   While at it, we should
> > switch the master/service interaction to a modern design, since the
> > operating system worth bothering with nowadays deal sanely with the
> > thundering herd effect, and all of them have proper socket event support
> > (epoll-like. Would require one of the event abstraction libraries,
> > though, so as to support linux/bsd/solaris with minimum fuss).
> 
> Certainly worth considering.  I won't have the time to work on it for
> while since what we have now works fine for us.  I'll be focussing my
> work on new features pretty soon, once 2.4.x is stable enough that I
> can trust that it will be reliable for people!  But if you want to look
> at it and come up with something better for 2.5 or even further ahead,
> that would be fantastic.  There's certainly plenty of parts of Cyrus
> that could do with some modernising!

Isn't the modern design multiple threads, rather than multiple
processes?  That seems to me to be the right direction for Cyrus.
It might even make for a simpler design.

-- 
-Gary Mills--Unix Group--Computer and Network Services-

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Bron Gondwana
Sorry - I've been busy working on the specific problem rather than the
overview, and I realised I kind of glossed over this bit:

On Thu, Nov 11, 2010 at 02:24:47PM -0200, Henrique de Moraes Holschuh wrote:
> This probably needs a redesign of master/service fd-passing protocol,
> and of prot streams to be fixed for good.   While at it, we should
> switch the master/service interaction to a modern design, since the
> operating system worth bothering with nowadays deal sanely with the
> thundering herd effect, and all of them have proper socket event support
> (epoll-like. Would require one of the event abstraction libraries,
> though, so as to support linux/bsd/solaris with minimum fuss).

Certainly worth considering.  I won't have the time to work on it for
while since what we have now works fine for us.  I'll be focussing my
work on new features pretty soon, once 2.4.x is stable enough that I
can trust that it will be reliable for people!  But if you want to look
at it and come up with something better for 2.5 or even further ahead,
that would be fantastic.  There's certainly plenty of parts of Cyrus
that could do with some modernising!

Thanks,

Bron.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Bron Gondwana
On Thu, Nov 11, 2010 at 10:12:31AM +0100, Paul Dekkers wrote:
> Hmm, allright, so I ran it with a truss (like strace for FreeBSD) to
> give me a bit more verbosity, and I realized I should chown.
> 
> But then:
> 
> # chown cyrus 22003.
> # sudo -u cyrus /usr/local/cyrus/bin/reconstruct user.bla
> fatal error: can't read mailboxes file

Confirmed: bug in FreeBSD's truss.  If you must truss, always
run it with a specified output file, which should avoid the
bug.  Otherwise a random file (mailboxes.db in this case) will
get random stderr junk written to it!

Bron.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Bron Gondwana
On Thu, Nov 11, 2010 at 02:24:47PM -0200, Henrique de Moraes Holschuh wrote:
> On Thu, 11 Nov 2010, Paul Dekkers wrote:
> > Uhoh! And then I looked at mailboxes.db: It looks like part completely
> > rewritten, including the skiplist header, and the first line now said:
> > user.bla: System I/O error System I/O error
> 
> This is something that has plagued cyrus for a long time.  Can we find a
> way to actually keep tabs on our FDs so it cannot ever happen again,
> please?  I recall reports of crap showing inside prot streams 10 years
> ago... if now it is leaking into even worse places, well...

Truss on Solaris 10:

-bash-3.00# truss ls
execve("/usr/ucb/ls", 0x08047D9C, 0x08047DA4)  argc = 1
resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12
mmap(0x, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, 
-1, 0) = 0xFEFF
resolvepath("/usr/ucb/ls", "/usr/ucb/ls", 1023) = 11
sysconfig(_CONFIG_PAGESIZE) = 4096
xstat(2, "/usr/ucb/ls", 0x08047B78) = 0
open("/var/ld/ld.config", O_RDONLY) = 3


Truss on FreeBSD:

[r...@cyrus1 /tmp]# truss ls
__sysctl(0x7fffe470,0x2,0x7fffe48c,0x7fffe480,0x0,0x0) = 0 (0x0)
mmap(0x0,672,PROT_READ|PROT_WRITE,MAP_ANON,-1,0x0) = 34365202432 (0x800536000)
munmap(0x800536000,672)  = 0 (0x0)
__sysctl(0x7fffe4e0,0x2,0x80063f428,0x7fffe4d8,0x0,0x0) = 0 (0x0)
mmap(0x0,32768,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34365202432 
(0x800536000)
issetugid(0x800537015,0x800530cc4,0x80064bc50,0x80064bc20,0x6351,0x0) = 0 (0x0)
open("/etc/libmap.conf",O_RDONLY,0666)   ERR#2 'No such file or directory'
open("/var/run/ld-elf.so.hints",O_RDONLY,057)= 2 (0x2)


It's definitely FreeBSD that's at fault here!

Bron.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Bron Gondwana
On Thu, Nov 11, 2010 at 02:24:47PM -0200, Henrique de Moraes Holschuh wrote:
> On Thu, 11 Nov 2010, Paul Dekkers wrote:
> > Uhoh! And then I looked at mailboxes.db: It looks like part completely
> > rewritten, including the skiplist header, and the first line now said:
> > user.bla: System I/O error System I/O error
> 
> This is something that has plagued cyrus for a long time.  Can we find a
> way to actually keep tabs on our FDs so it cannot ever happen again,
> please?  I recall reports of crap showing inside prot streams 10 years
> ago... if now it is leaking into even worse places, well...

Here's the ktrace/kdump output on FreeBSD:

 45426 reconstruct CALL  access(0x80065d000,F_OK)
 45426 reconstruct NAMI  "/usr/local/lib/libsasl2.so.2"
 45426 reconstruct RET   access 0
 45426 reconstruct CALL  open(0x80065e000,O_RDONLY,0x763300)
 45426 reconstruct NAMI  "/usr/local/lib/libsasl2.so.2"
 45426 reconstruct RET   open 3
 45426 reconstruct CALL  fstat(0x3,0x7fffe350)
 45426 reconstruct STRU  struct stat {dev=87, ino=1084119, mode=-rwxr-xr-x , 
nlink=1, uid=0, gid=0, rdev=4338624, atime=1289519226, stime=1276299162, 
ctime=1289480449, birthtime=1276299162, size=114591, blksize=16384, blocks=224, 
flags=0x0 }
 45426 reconstruct RET   fstat 0
 45426 reconstruct CALL  pread(0x3,0x8007622e0,0x1000,0)
 45426 reconstruct GIO   fd 3 read 4096 bytes


What do you know fd 3.  It's almost certainly truss doing something
stupid.

We maybe be able to work around it in Cyrus - and that might actually be
worth it - but I don't think it's Cyrus' fault.

Bron.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Bron Gondwana
On Thu, Nov 11, 2010 at 02:24:47PM -0200, Henrique de Moraes Holschuh wrote:
> On Thu, 11 Nov 2010, Paul Dekkers wrote:
> > Uhoh! And then I looked at mailboxes.db: It looks like part completely
> > rewritten, including the skiplist header, and the first line now said:
> > user.bla: System I/O error System I/O error
> 
> This is something that has plagued cyrus for a long time.  Can we find a
> way to actually keep tabs on our FDs so it cannot ever happen again,
> please?  I recall reports of crap showing inside prot streams 10 years
> ago... if now it is leaking into even worse places, well...

It's a standalone program.  Reconstruct was running all by itself.
 
> This probably needs a redesign of master/service fd-passing protocol,
> and of prot streams to be fixed for good.   While at it, we should
> switch the master/service interaction to a modern design, since the
> operating system worth bothering with nowadays deal sanely with the
> thundering herd effect, and all of them have proper socket event support
> (epoll-like. Would require one of the event abstraction libraries,
> though, so as to support linux/bsd/solaris with minimum fuss).

Since that wasn't the issue - why on earth was it allowed to have fd 2
in the first place?  Is Cyrus closing fd 2, or is truss closing it??

There was no issue outside truss, it was when it ran under truss that
the issue happened.

Here's the start of an strace of a reconstruct run on my machine:

execve("/usr/cyrus/bin/reconstruct", ["/usr/cyrus/bin/reconstruct", "-C", 
"/tmp/ct-slot2/etc/imapd.conf", "-s"], [/* 20 vars */]) = 0
brk(0)  = 0x12f1000
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7fceb52d8000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("db-4.6/lib/tls/x86_64/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such file 
or directory)
open("db-4.6/lib/tls/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such file or 
directory)
open("db-4.6/lib/x86_64/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such file or 
directory)
open("db-4.6/lib/libsasl2.so.2", O_RDONLY) = -1 ENOENT (No such file or 
directory)
open("/etc/ld.so.cache", O_RDONLY)  = 3


Notice the first fd allocated: 3.

And here's a run under truss on FreeBSD:

[r...@cyrus1 /var/imap]# sudo -u cyrus truss /usr/local/cyrus/bin/reconstruct 
user.foo
__sysctl(0x7fffe390,0x2,0x7fffe3ac,0x7fffe3a0,0x0,0x0) = 0 (0x0)
mmap(0x0,672,PROT_READ|PROT_WRITE,MAP_ANON,-1,0x0) = 34366398464 (0x80065a000)
munmap(0x80065a000,672)  = 0 (0x0)
__sysctl(0x7fffe400,0x2,0x800763428,0x7fffe3f8,0x0,0x0) = 0 (0x0)
mmap(0x0,32768,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366398464 
(0x80065a000)
issetugid(0x80065b015,0x800654cc4,0x80076fc50,0x80076fc20,0x6351,0x0) = 0 (0x0)
open("/etc/libmap.conf",O_RDONLY,0666)   ERR#2 'No such file or directory'
access("/usr/lib/libsasl2.so.2",0)   ERR#2 'No such file or directory'
access("/usr/local/lib/libsasl2.so.2",0) = 0 (0x0)
open("/usr/local/lib/libsasl2.so.2",O_RDONLY,035431400) = 2 (0x2)

Note the first fd allocated: 2!


The question is - why is fd 2 being allocated?  Is it necessary to explicitly
open stderr?  The function that's scribbling all over everything is com_err,
which is supposed to be a BSD error reporting library, it SHOULD know what
it's doing...

Bron ( a while later, fd 2 gets re-used as the mailboxes.db handle, and hence
   the mess is created )

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Henrique de Moraes Holschuh
On Thu, 11 Nov 2010, Paul Dekkers wrote:
> Uhoh! And then I looked at mailboxes.db: It looks like part completely
> rewritten, including the skiplist header, and the first line now said:
> user.bla: System I/O error System I/O error

This is something that has plagued cyrus for a long time.  Can we find a
way to actually keep tabs on our FDs so it cannot ever happen again,
please?  I recall reports of crap showing inside prot streams 10 years
ago... if now it is leaking into even worse places, well...

This probably needs a redesign of master/service fd-passing protocol,
and of prot streams to be fixed for good.   While at it, we should
switch the master/service interaction to a modern design, since the
operating system worth bothering with nowadays deal sanely with the
thundering herd effect, and all of them have proper socket event support
(epoll-like. Would require one of the event abstraction libraries,
though, so as to support linux/bsd/solaris with minimum fuss).

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


reconstruct caused mailboxes (skiplist) corruption?

2010-11-11 Thread Paul Dekkers
Hi,

Maybe I've some more 2.4.3 badness:

I just decided to restore one message, and copied it from the archive.
Unfortunately, I didn't copy properly, so the ownership was root instead
of cyrus.

I then ran a
# sudo -u cyrus /usr/local/cyrus/bin/reconstruct user.bla
user.bla: System I/O error System I/O error

Hmm, allright, so I ran it with a truss (like strace for FreeBSD) to
give me a bit more verbosity, and I realized I should chown.

But then:

# chown cyrus 22003.
# sudo -u cyrus /usr/local/cyrus/bin/reconstruct user.bla
fatal error: can't read mailboxes file

Ehm, that's bad.

# sudo -u cyrus /usr/local/cyrus/bin/ctl_mboxlist -d
fatal error: can't read mailboxes file

Uhoh! And then I looked at mailboxes.db: It looks like part completely
rewritten, including the skiplist header, and the first line now said:
user.bla: System I/O error System I/O error

Oops? Fortunately I had a recent copy, I just decided to make a tarball
of /var/imap after realizing I probably lost (a recent backup of) seen
state (related to other thread) because /var/imap is not on ZFS :-S

Regards,
Paul

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/