Re: [Dovecot] dsync timeout?

2013-02-08 Thread micah anderson

Sean Kamath  writes:

> On Jan 30, 2013, at 3:46 PM, micah anderson  wrote:
>> Seems that only the above process was still around and no other dsync
>> processes. I have three machines that all have this happening it seems.
>> 
>> I wonder if there is a ssh configuration option I could set to make
>> these die off.
>
> If the ssh process isn't sending anything, and just waiting for read()s, and 
> keepalives are turned off, the SSH session might never know the remote side 
> is long gone. . .

This time I managed to capture a process that was stuck and look at it
from the server side, and the client side:

on the server:

2000 19470  0.0  0.0   7512  3816 ?Ss   Feb05   0:01 /usr/bin/dsync 
dsync-server -E -u foo
# strace -s 1024 -F -p 19470
Process 19470 attached - interrupt to quit
write(2, "dsync-remote(foo): Error: mdbox 
/srv/maildirbackups/foo/daily.1/storage: Duplicate GUID 
96860517f68aa94f8b5197f19f0b in m.41:682501 and m.37:653225\n", 167

on the client:

root 19001  0.0  0.0  41308  1600 ?SFeb05   0:00 ssh -i 
/root/.ssh/backmaildir_id_rsa backmaildir@hoopoe-pn /usr/bin/dsync -u foo server

# strace -s 1024 -F -p 19001
Process 19001 attached - interrupt to quit select(8, [4], [], NULL, NULL

interestingly, now that I've been watching this more, the same users
keep getting wedged. 

When I attempt to do a dsync of that user by hand, I get this:

dsync-local(foo): Error: Unexpected reply from server: 13   
d2a100118c45d24f760f97f19f0b3561128 \Recent 1353980259

I tried one of the other users that was stuck, and it gave me:

dsync-remote(bar): Error: Corrupted dbox file 
/srv/maildirbackups/bar/daily.1/storage/m.130 (around offset=22532): msg header 
has bad magic value

This looks like there is something corrupted with the dbox for the user
on the client side, is there something I can do to repair those?

> If any data were transmitted, it would discover the remote side is turned off.

One thing I am doing is using a ssh controlmaster socket, and if I kill
the process on the client's side, the server side process also dies.

micah


Re: [Dovecot] dsync timeout?

2013-02-02 Thread Sean Kamath

On Feb 1, 2013, at 8:09 AM, micah anderson  wrote:

> Sean Kamath  writes:
> 
>> On Jan 30, 2013, at 3:46 PM, micah anderson  wrote:
>>> Seems that only the above process was still around and no other dsync
>>> processes. I have three machines that all have this happening it seems.
>>> 
>>> I wonder if there is a ssh configuration option I could set to make
>>> these die off.
>> 
>> If the ssh process isn't sending anything, and just waiting for read()s, and 
>> keepalives are turned off, the SSH session might never know the remote side 
>> is long gone. . .
>> 
>> If any data were transmitted, it would discover the remote side is turned 
>> off.
>> 
>> See man ssh_config and the option TCPKeepAlive.
>> 
>> BTW: Since it's not on the command line, it's likely in /etc/ssh_config or 
>> /etc/ssh/ssh_config.  Or ~/.ssh/config.
> 
> In /etc/ssh/sshd_config on the server I'm sending to, TCPKeepAlive yes
> is set.

Did you check ~/.ssh/config for the user running the dsync?

> The default on this system, according to the man page, seems to be to
> have TCPKeepAlive set. 
> 
> Perhaps I should set ServerAliveInterval?


Perhaps.  That states how long to send the KeepAlive packet.

There are many settings that can affect this, including

ServerAliveCountMax
ServerAliveInterval
TCPKeepAlive

There is also the sshd_config settings

ClientAliveCountMax
ClientAliveInterval
TCPKeepAlive

At this point, I think you need to see what's happening on both sides of the 
SSH connection.  I don't recall what system you're on, but for linux you can 
use netstat -anp (as root) to find out what process is connected to which port, 
and on linux and other systems you can use lsof to find out what is connected 
to ports.

Maybe the TCP port is open and valid and there's no data coming through?  This 
can happen if, for example, you have any port forwarding or X session 
forwarding through SSH (i.e., if ssh -X is the default) and something 
accidentally is holding that port open (this can happen in your regular shell 
if, for example, you have something open an X application and you forget 
(because you backgrounded it) -- you're logout of the server will hang until 
the X applications are closed.  Note that it isn't always a visible client that 
will do this. :-().

Sean



Re: [Dovecot] dsync timeout?

2013-02-01 Thread micah anderson
Sean Kamath  writes:

> On Jan 30, 2013, at 3:46 PM, micah anderson  wrote:
>> Seems that only the above process was still around and no other dsync
>> processes. I have three machines that all have this happening it seems.
>> 
>> I wonder if there is a ssh configuration option I could set to make
>> these die off.
>
> If the ssh process isn't sending anything, and just waiting for read()s, and 
> keepalives are turned off, the SSH session might never know the remote side 
> is long gone. . .
>
> If any data were transmitted, it would discover the remote side is turned off.
>
> See man ssh_config and the option TCPKeepAlive.
>
> BTW: Since it's not on the command line, it's likely in /etc/ssh_config or 
> /etc/ssh/ssh_config.  Or ~/.ssh/config.

In /etc/ssh/sshd_config on the server I'm sending to, TCPKeepAlive yes
is set.

The default on this system, according to the man page, seems to be to
have TCPKeepAlive set. 

Perhaps I should set ServerAliveInterval?

micah


Re: [Dovecot] dsync timeout?

2013-01-30 Thread Sean Kamath

On Jan 30, 2013, at 3:46 PM, micah anderson  wrote:
> Seems that only the above process was still around and no other dsync
> processes. I have three machines that all have this happening it seems.
> 
> I wonder if there is a ssh configuration option I could set to make
> these die off.

If the ssh process isn't sending anything, and just waiting for read()s, and 
keepalives are turned off, the SSH session might never know the remote side is 
long gone. . .

If any data were transmitted, it would discover the remote side is turned off.

See man ssh_config and the option TCPKeepAlive.

BTW: Since it's not on the command line, it's likely in /etc/ssh_config or 
/etc/ssh/ssh_config.  Or ~/.ssh/config.

Sean

Re: [Dovecot] dsync timeout?

2013-01-30 Thread micah anderson
Timo Sirainen  writes:

> On 31.1.2013, at 0.06, Micah Anderson  wrote:
>
>> I'm using dsync for a regular backup. The backup system flocks so that
>> two cannot run at the same time, which is generally a good thing. The
>> problem is that it seems like dsync sometimes goes off into the weeds
>> and never comes back, leaving a process running and doing nothing
>> forever, hogging the lock and causing my backups never to run again. I
>> just finally figured out that was what was causing the backups not to
>> run on this system was this process:
>> 
>> root 17836  0.0  0.0  40888  1600 ?S 2012   0:00 ssh -i 
>> /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server
>> 
>> yeah, that has been running since 2012 :(
>
> So that's the ssh process. What about the dsync process that started it? 
> Does/did it exist?

Seems that only the above process was still around and no other dsync
processes. I have three machines that all have this happening it seems.

I wonder if there is a ssh configuration option I could set to make
these die off.

>> There doesn't seem to be a timeout in dsync, but perhaps there should
>> be? At this point my only option is to write a cronjob that will look
>> for dsync processes that are over a certain amount of time old and then
>> kill them, after I do that I will need to take a shower because that is
>> a very dirty solution :P
>
> There is a 15 minute timeout in dsync after which it stops itself. Normally 
> the child process should also die.. v2.2 now will make sure that the child 
> process dies: http://hg.dovecot.org/dovecot-2.2/rev/070ca24e5846

Interesting... I wonder why the child is not dying off properly, maybe
the wrong signal is sent?

looking forward to using 2.2!
micah

-- 


Re: [Dovecot] dsync timeout?

2013-01-30 Thread Timo Sirainen
On 31.1.2013, at 0.06, Micah Anderson  wrote:

> I'm using dsync for a regular backup. The backup system flocks so that
> two cannot run at the same time, which is generally a good thing. The
> problem is that it seems like dsync sometimes goes off into the weeds
> and never comes back, leaving a process running and doing nothing
> forever, hogging the lock and causing my backups never to run again. I
> just finally figured out that was what was causing the backups not to
> run on this system was this process:
> 
> root 17836  0.0  0.0  40888  1600 ?S 2012   0:00 ssh -i 
> /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server
> 
> yeah, that has been running since 2012 :(

So that's the ssh process. What about the dsync process that started it? 
Does/did it exist?

> There doesn't seem to be a timeout in dsync, but perhaps there should
> be? At this point my only option is to write a cronjob that will look
> for dsync processes that are over a certain amount of time old and then
> kill them, after I do that I will need to take a shower because that is
> a very dirty solution :P

There is a 15 minute timeout in dsync after which it stops itself. Normally the 
child process should also die.. v2.2 now will make sure that the child process 
dies: http://hg.dovecot.org/dovecot-2.2/rev/070ca24e5846



[Dovecot] dsync timeout?

2013-01-30 Thread Micah Anderson

I'm using dsync for a regular backup. The backup system flocks so that
two cannot run at the same time, which is generally a good thing. The
problem is that it seems like dsync sometimes goes off into the weeds
and never comes back, leaving a process running and doing nothing
forever, hogging the lock and causing my backups never to run again. I
just finally figured out that was what was causing the backups not to
run on this system was this process:

root 17836  0.0  0.0  40888  1600 ?S 2012   0:00 ssh -i 
/root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server

yeah, that has been running since 2012 :(

root:/tmp# strace -p 17836
Process 17836 attached - interrupt to quit
select(8, [4], [], NULL, NULL

very exciting...

There doesn't seem to be a timeout in dsync, but perhaps there should
be? At this point my only option is to write a cronjob that will look
for dsync processes that are over a certain amount of time old and then
kill them, after I do that I will need to take a shower because that is
a very dirty solution :P

thanks for any ideas, or help!
micah
--