Re: [Mailman-Users] Stuck OutgoingRunner

2018-03-16 Thread Sebastian Hagedorn

It happened again yesterday. Details below.

--On 7. Februar 2018 um 12:43:18 +0900 Yasuhito FUTATSUKI 
 wrote:



In fact,

On 02/02/18 19:26, Sebastian Hagedorn wrote:

root@mailman3/usr/lib/mailman/bin]$ strace -p 1677
Process 1677 attached
recvfrom(10, ^CProcess 1677 detached


indicates the OutGoingRunner process 1677 was still in recvfrom(2)
system call (perhaps called from recv(2)) for FD 10, and


[root@mailman3/usr/lib/mailman/bin]$ lsof -p 1677
COMMANDPIDUSER   FD   TYPE   DEVICE SIZE/OFF   NODE NAME
python2.7 1677 mailman  cwdDIR253,0 4096 173998
/usr/lib/mailman python2.7 1677 mailman  rtdDIR253,0 4096
2 /
...
python2.7 1677 mailman   10u  IPv6 46441320  0t0TCP
mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp
(ESTABLISHED)


indicates its FD 10 was ESTABLISHED connection to the MTA.


That situation was exactly the same. This time we confirmed on the MTA that 
there was no trace of that connection anymore. At the time of the incident, 
the MTA was once again under high load and delaying commands. That 
definitely seems to be a contributing factor. We didn't find any evidence 
of a connection that was dropped by the MTA, but with four OutgoingRunners 
we didn't find a way to determine which transaction related to which runner.



If the MTA is hanging up (or very slow progress) in application layer and
keeping alive TCP connection in lower layer, client using smtplib
without specifying timeout, like current SMTPDirect handler in Mailman,
must wait for response or the MTA dying.


If I understood Mark correctly, when the MTA dropped the connection that 
should have raised socket.error regardless of timeouts. The question is why 
it didn't. I suppose that could be either a bug in our version of the 
Python libraries or in the OS. Any ideas how we should proceed to determine 
the root cause?

--
   .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
  .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.

pgpFhXdbXEGuH.pgp
Description: PGP signature
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-06 Thread Yasuhito FUTATSUKI

On 02/07/18 01:01, Mark Sapiro wrote:

On 02/06/2018 03:51 AM, Sebastian Hagedorn wrote:


--On 4. Februar 2018 um 12:54:43 +0900 Yasuhito FUTATSUKI
 wrote:


As far as I read the code, if OutgoingRunner catch SIGINT during waiting
for response from the MTA, the signal handler for SIGINT in qrunner set
flag to exit from loop, then socket module raise socket.error for EINTR,
but SMTP module retry to read from socket and waiting for response until
receiving response or connection closing (from MTA side or by error).
Thus it cannot reach to the code to exit if the connection is kept alive
and MTA send no data.


I'm sorry, above is partly wrong, it is not smtplib.SMTP object to continue
reading but socket module itself.(on Python 2.7.14, 
socket._fileobject.readline())
But it does not affect main subject.


Thanks. I think that might be a possible explanation, but what could
cause a SIGINT to be sent to the OutgoingRunner?



The above is an explanation of why the runner doesn't exit when it
receives a SIGINT or SIGTERM from the master when you restart or stop
Mailman and why you have to SIGKILL it. It suggests that what's
happening when it's hung is it's waiting for a response from the MTA.


thanks to explain for my intension.

In fact,

On 02/02/18 19:26, Sebastian Hagedorn wrote:

root@mailman3/usr/lib/mailman/bin]$ strace -p 1677
Process 1677 attached
recvfrom(10, ^CProcess 1677 detached


indicates the OutGoingRunner process 1677 was still in recvfrom(2)
system call (perhaps called from recv(2)) for FD 10, and


[root@mailman3/usr/lib/mailman/bin]$ lsof -p 1677
COMMANDPIDUSER   FD   TYPE   DEVICE SIZE/OFF   NODE NAME
python2.7 1677 mailman  cwdDIR253,0 4096 173998 /usr/lib/mailman
python2.7 1677 mailman  rtdDIR253,0 4096  2 /
...
python2.7 1677 mailman   10u  IPv6 46441320  0t0TCP 
mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp (ESTABLISHED)


indicates its FD 10 was ESTABLISHED connection to the MTA.


If the MTA is hanging up (or very slow progress) in application layer and
keeping alive TCP connection in lower layer, client using smtplib
without specifying timeout, like current SMTPDirect handler in Mailman,
must wait for response or the MTA dying.

Unfortunately smtplib for Python 2 before 2.6 don't have way to specify
timeout. It uses a socket in blocking mode unless seting default timeout
by using socket.setdefaulttimeout() before calling smtplib.SMTP.connction().
For Python 2.6 and above, it can be specified on create smtplib.SMTP object.

--
Yasuhito FUTATSUKI 

--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-06 Thread Mark Sapiro
On 02/06/2018 03:48 AM, Sebastian Hagedorn wrote:
> 
> Is it possible that the OutgoingRunner was done with transmitting the
> message and had already removed the queue file, but that the connection
> hadn't yet been closed?


Only if something went very wrong in SMTPDirect.process() which would
have had to return to OutgoingRunner before the .bak would be removed.

As I read the code in SMTPDirect.process(), delivery is in a try: ...
finally: and the connection is closed in the finally: clause.


> Now I wonder the MTA had already closed the connection and Mailman for
> some reason didn't notice. Because the Runner was stuck longer than any
> timeout on the MTA would permit. But I failed to check that. Should it
> hapen again I will have a look on the MTA end.


You should be able to see the connect and all that follows in the MTA
logs to see if the MTA closed the connection.

Anyway, if it did do that while the runner was waiting for response,
that should raise socket.error which would be caught and handled.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-06 Thread Sebastian Hagedorn


--On 6. Februar 2018 um 08:01:18 -0800 Mark Sapiro  wrote:


On 02/06/2018 03:51 AM, Sebastian Hagedorn wrote:


--On 4. Februar 2018 um 12:54:43 +0900 Yasuhito FUTATSUKI
 wrote:


As far as I read the code, if OutgoingRunner catch SIGINT during waiting
for response from the MTA, the signal handler for SIGINT in qrunner set
flag to exit from loop, then socket module raise socket.error for EINTR,
but SMTP module retry to read from socket and waiting for response until
receiving response or connection closing (from MTA side or by error).
Thus it cannot reach to the code to exit if the connection is kept alive
and MTA send no data.


Thanks. I think that might be a possible explanation, but what could
cause a SIGINT to be sent to the OutgoingRunner?



The above is an explanation of why the runner doesn't exit when it
receives a SIGINT or SIGTERM from the master when you restart or stop
Mailman and why you have to SIGKILL it. It suggests that what's
happening when it's hung is it's waiting for a response from the MTA.


Ah OK, I misunderstood that part.
--
   .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
  .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-06 Thread Mark Sapiro
On 02/06/2018 03:51 AM, Sebastian Hagedorn wrote:
> 
> --On 4. Februar 2018 um 12:54:43 +0900 Yasuhito FUTATSUKI
>  wrote:
>>
>> As far as I read the code, if OutgoingRunner catch SIGINT during waiting
>> for response from the MTA, the signal handler for SIGINT in qrunner set
>> flag to exit from loop, then socket module raise socket.error for EINTR,
>> but SMTP module retry to read from socket and waiting for response until
>> receiving response or connection closing (from MTA side or by error).
>> Thus it cannot reach to the code to exit if the connection is kept alive
>> and MTA send no data.
> 
> Thanks. I think that might be a possible explanation, but what could
> cause a SIGINT to be sent to the OutgoingRunner?


The above is an explanation of why the runner doesn't exit when it
receives a SIGINT or SIGTERM from the master when you restart or stop
Mailman and why you have to SIGKILL it. It suggests that what's
happening when it's hung is it's waiting for a response from the MTA.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-06 Thread Sebastian Hagedorn


--On 4. Februar 2018 um 12:54:43 +0900 Yasuhito FUTATSUKI 
 wrote:



On 02/04/18 12:13, Mark Sapiro wrote:

The status of 'S' for OutgoingRunner is "uninterruptable sleep". This
means it's either called time.sleep for QRUNNER_SLEEP_TIME (default = 1
second) which is unlikely as it should wake up, or it's waiting for
response from something, most likely a response from the MTA.


As far as I read the code, if OutgoingRunner catch SIGINT during waiting
for response from the MTA, the signal handler for SIGINT in qrunner set
flag to exit from loop, then socket module raise socket.error for EINTR,
but SMTP module retry to read from socket and waiting for response until
receiving response or connection closing (from MTA side or by error).
Thus it cannot reach to the code to exit if the connection is kept alive
and MTA send no data.


Thanks. I think that might be a possible explanation, but what could cause 
a SIGINT to be sent to the OutgoingRunner?

--
   .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
  .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-06 Thread Sebastian Hagedorn

--On 3. Februar 2018 um 19:13:33 -0800 Mark Sapiro  wrote:


On 02/03/2018 01:03 AM, Sebastian Hagedorn wrote:


Did you look at the out queue, and if so was there a .bak file there.
This would be the entry currently being processed.


I looked at the out queue, and there was no .bak file.



Interesting. That says that OutgoingRunner is not currently delivering a
message, but that is inconsistent with this:



Also, the TCP connection to the MTA being ESTABLISHED says the
OutgoingRunner has called SMTPDirect.process() and it in turn is
somewhere in its delivery loop of sending SMTP transactions.


Is it possible that the OutgoingRunner was done with transmitting the 
message and had already removed the queue file, but that the connection 
hadn't yet been closed?



Are there any clues in the MTA logs?


I just found this in Mailman's smtp-failures log:

Feb 01 14:28:49 2018 (1674) Low level smtp error: [Errno 111] Connection
refused, msgid:

Feb 01 14:28:49 2018 (1674) delivery to x...@uni-koeln.de failed with
code -1: [Errno 111] Connection refused


Normally, that won't cause a problem like this. This occurs at a fairly
low level in SMTPDirect.py when Mailman is initiating a transaction with
the MTA to send to one or more recipients. The recipients will be marked
as "refused retryably" and OutgoingRunner will queue the message for
those recipients. in the retry queue to be retried

You can set SMTPLIB_DEBUG_LEVEL = 1 in mm_cfg.py to log copious smtplib
debugging info to Mailman's error log. Then the log will show the last
thing that was done before the hang.


The problem with that is that we run 3,200 lists on that server. Not all of 
them are high-volume, but I'm worried that our log files would explode. Ich 
just checked and yesterday we sent mails to 50,000 recipients.



If this should happen again, what should we look for? Would a gdb
backtrace be helpful?


It might be if you can find just where in the code it's hung. Also, I
didn't look carefully before, but in your OP, you show


mailman   1663  0.0  0.0 233860  2204 ?Ss   Jan16   0:00
/usr/bin/python2.7 /usr/lib/mailman/bin/mailmanctl -s -q start
mailman   1677  0.1  0.9 295064 73284 ?SJan16  35:35
/usr/bin/python2.7 /usr/lib/mailman/bin/qrunner
--runner=OutgoingRunner:3:4  -s


The status of 'S' for OutgoingRunner is "uninterruptable sleep". This
means it's either called time.sleep for QRUNNER_SLEEP_TIME (default = 1
second) which is unlikely as it should wake up, or it's waiting for
response from something, most likely a response from the MTA.


Now I wonder the MTA had already closed the connection and Mailman for some 
reason didn't notice. Because the Runner was stuck longer than any timeout 
on the MTA would permit. But I failed to check that. Should it hapen again 
I will have a look on the MTA end.

--
   .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
  .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-03 Thread Yasuhito FUTATSUKI


On 02/04/18 12:13, Mark Sapiro wrote:

The status of 'S' for OutgoingRunner is "uninterruptable sleep". This
means it's either called time.sleep for QRUNNER_SLEEP_TIME (default = 1
second) which is unlikely as it should wake up, or it's waiting for
response from something, most likely a response from the MTA.


As far as I read the code, if OutgoingRunner catch SIGINT during waiting
for response from the MTA, the signal handler for SIGINT in qrunner set
flag to exit from loop, then socket module raise socket.error for EINTR,
but SMTP module retry to read from socket and waiting for response until
receiving response or connection closing (from MTA side or by error).
Thus it cannot reach to the code to exit if the connection is kept alive
and MTA send no data.

--
Yasuhito FUTATSUKI 
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-03 Thread Mark Sapiro
On 02/03/2018 01:03 AM, Sebastian Hagedorn wrote:
>>
>> Did you look at the out queue, and if so was there a .bak file there.
>> This would be the entry currently being processed.
> 
> I looked at the out queue, and there was no .bak file.


Interesting. That says that OutgoingRunner is not currently delivering a
message, but that is inconsistent with this:


>> Also, the TCP connection to the MTA being ESTABLISHED says the
>> OutgoingRunner has called SMTPDirect.process() and it in turn is
>> somewhere in its delivery loop of sending SMTP transactions.
>>
>> Are there any clues in the MTA logs?
> 
> I just found this in Mailman's smtp-failures log:
> 
> Feb 01 14:28:49 2018 (1674) Low level smtp error: [Errno 111] Connection
> refused, msgid:
> 
> Feb 01 14:28:49 2018 (1674) delivery to x...@uni-koeln.de failed with
> code -1: [Errno 111] Connection refused
> 
> I can't prove it, but this time stamp seems to coincide with the moment
> the OutgoingRunner got stuck, based on the age of the queue files. The
> receiving SMTP server was under heavy load at that moment, so it is
> possible that it might have refused the connection.

Normally, that won't cause a problem like this. This occurs at a fairly
low level in SMTPDirect.py when Mailman is initiating a transaction with
the MTA to send to one or more recipients. The recipients will be marked
as "refused retryably" and OutgoingRunner will queue the message for
those recipients. in the retry queue to be retried

You can set SMTPLIB_DEBUG_LEVEL = 1 in mm_cfg.py to log copious smtplib
debugging info to Mailman's error log. Then the log will show the last
thing that was done before the hang.


> If this should happen again, what should we look for? Would a gdb
> backtrace be helpful?


It might be if you can find just where in the code it's hung. Also, I
didn't look carefully before, but in your OP, you show

> mailman   1663  0.0  0.0 233860  2204 ?Ss   Jan16   0:00 
> /usr/bin/python2.7 /usr/lib/mailman/bin/mailmanctl -s -q start
> mailman   1677  0.1  0.9 295064 73284 ?SJan16  35:35 
> /usr/bin/python2.7 /usr/lib/mailman/bin/qrunner --runner=OutgoingRunner:3:4 
> -s

The status of 'S' for OutgoingRunner is "uninterruptable sleep". This
means it's either called time.sleep for QRUNNER_SLEEP_TIME (default = 1
second) which is unlikely as it should wake up, or it's waiting for
response from something, most likely a response from the MTA.

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-03 Thread Sebastian Hagedorn

Thanks for your reply!


On 02/02/2018 02:26 AM, Sebastian Hagedorn wrote:

[root@mailman3/usr/lib/mailman/bin]$ lsof -p 1677
COMMAND    PID    USER   FD   TYPE   DEVICE SIZE/OFF  
NODE NAME python2.7 1677 mailman  cwd    DIR    253,0
4096 173998 /usr/lib/mailman
python2.7 1677 mailman  rtd    DIR    253,0
4096  2 / ...
python2.7 1677 mailman   10u  IPv6 46441320  0t0    TCP
mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp
(ESTABLISHED)

In both instances the OutgoingRunner was stuck on an SMTP connection. I
had to use "kill -9" to get rid of it.

Any ideas what might be causing that?


I think I've seen this once or maybe twice, I don't recall details. I
wasn't able to determine a cause. I haven't seen it in years.

Did you look at the out queue, and if so was there a .bak file there.
This would be the entry currently being processed.


I looked at the out queue, and there was no .bak file.


Also, the TCP connection to the MTA being ESTABLISHED says the
OutgoingRunner has called SMTPDirect.process() and it in turn is
somewhere in its delivery loop of sending SMTP transactions.

Are there any clues in the MTA logs?


I just found this in Mailman's smtp-failures log:

Feb 01 14:28:49 2018 (1674) Low level smtp error: [Errno 111] Connection 
refused, msgid: 

Feb 01 14:28:49 2018 (1674) delivery to x...@uni-koeln.de failed with code 
-1: [Errno 111] Connection refused


I can't prove it, but this time stamp seems to coincide with the moment the 
OutgoingRunner got stuck, based on the age of the queue files. The 
receiving SMTP server was under heavy load at that moment, so it is 
possible that it might have refused the connection.


The message was delivered successfully after I killed the stuck runner and 
restarted the service. I wasn't able to find anything pertinent on the 
receiving server.


If this should happen again, what should we look for? Would a gdb backtrace 
be helpful?

--
Sebastian Hagedorn - Weyertal 121, Zimmer 2.02
Regionales Rechenzentrum (RRZK)
Universität zu Köln / Cologne University - Tel. +49-221-470-89578
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] Stuck OutgoingRunner

2018-02-02 Thread Mark Sapiro
On 02/02/2018 02:26 AM, Sebastian Hagedorn wrote:
> Hi,
> 
> we've been running Mailman for many years and have never had stability
> issues, but about a month ago we moved the server from RHEL 5 to RHEL 6
> and to the current version (2.1.25), and since then it has already
> happened twice that one of our four OutgoingRunners got "stuck" and
> stopped handling mail. When that happens a simple restart of the service
> does not work. These processes remained:
> 
> mailman   1663  0.0  0.0 233860  2204 ?    Ss   Jan16   0:00
> /usr/bin/python2.7 /usr/lib/mailman/bin/mailmanctl -s -q start
> mailman   1677  0.1  0.9 295064 73284 ?    S    Jan16  35:35
> /usr/bin/python2.7 /usr/lib/mailman/bin/qrunner
> --runner=OutgoingRunner:3:4 -s


Because pid 1677 didn't respond to the SIGINT from the master and the
master is still waiting for it to exit.


> root@mailman3/usr/lib/mailman/bin]$ strace -p 1677
> Process 1677 attached
> recvfrom(10, ^CProcess 1677 detached
> 
> [root@mailman3/usr/lib/mailman/bin]$ lsof -p 1677
> COMMAND    PID    USER   FD   TYPE   DEVICE SIZE/OFF   NODE NAME
> python2.7 1677 mailman  cwd    DIR    253,0 4096 173998
> /usr/lib/mailman
> python2.7 1677 mailman  rtd    DIR    253,0 4096  2 /
> ...
> python2.7 1677 mailman   10u  IPv6 46441320  0t0    TCP
> mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp
> (ESTABLISHED)
> 
> In both instances the OutgoingRunner was stuck on an SMTP connection. I
> had to use "kill -9" to get rid of it.
> 
> Any ideas what might be causing that?


I think I've seen this once or maybe twice, I don't recall details. I
wasn't able to determine a cause. I haven't seen it in years.

Did you look at the out queue, and if so was there a .bak file there.
This would be the entry currently being processed.

Also, the TCP connection to the MTA being ESTABLISHED says the
OutgoingRunner has called SMTPDirect.process() and it in turn is
somewhere in its delivery loop of sending SMTP transactions.

Are there any clues in the MTA logs?

-- 
Mark Sapiro The highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org


[Mailman-Users] Stuck OutgoingRunner

2018-02-02 Thread Sebastian Hagedorn

Hi,

we've been running Mailman for many years and have never had stability 
issues, but about a month ago we moved the server from RHEL 5 to RHEL 6 and 
to the current version (2.1.25), and since then it has already happened 
twice that one of our four OutgoingRunners got "stuck" and stopped handling 
mail. When that happens a simple restart of the service does not work. 
These processes remained:


mailman   1663  0.0  0.0 233860  2204 ?Ss   Jan16   0:00 
/usr/bin/python2.7 /usr/lib/mailman/bin/mailmanctl -s -q start
mailman   1677  0.1  0.9 295064 73284 ?SJan16  35:35 
/usr/bin/python2.7 /usr/lib/mailman/bin/qrunner --runner=OutgoingRunner:3:4 
-s


root@mailman3/usr/lib/mailman/bin]$ strace -p 1677
Process 1677 attached
recvfrom(10, ^CProcess 1677 detached

[root@mailman3/usr/lib/mailman/bin]$ lsof -p 1677
COMMANDPIDUSER   FD   TYPE   DEVICE SIZE/OFF   NODE NAME
python2.7 1677 mailman  cwdDIR253,0 4096 173998 /usr/lib/mailman
python2.7 1677 mailman  rtdDIR253,0 4096  2 /
...
python2.7 1677 mailman   10u  IPv6 46441320  0t0TCP 
mailman3.rrz.uni-koeln.de:55764->smtp-out.rrz.uni-koeln.de:smtp 
(ESTABLISHED)


In both instances the OutgoingRunner was stuck on an SMTP connection. I had 
to use "kill -9" to get rid of it.


Any ideas what might be causing that?

Cheers
Sebastian
--
   .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
  .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
--
Mailman-Users mailing list Mailman-Users@python.org
https://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
https://mail.python.org/mailman/options/mailman-users/archive%40jab.org