Hi Scott,
You beat me to it today. I've been trying to figure out why one of our
mailservers keeps crashing. It is the only one running up to date. The
others are running 16025 without crashing.
I get the following:
2016-02-10 15:21:39 [Main_Thread] Info: Main_Thread got connection request
2016-02-10 15:21:39 [Worker_6] 175.101.68.117 info: no (more) data readable
from 175.101.68.117 (connection closed by peer)
2016-02-10 15:21:39 [Worker_6] Disconnected: session:7FA9A1A18E00
175.101.68.117 - command list was 'n/a' - used 1 SocketCalls - processing
time 0 seconds - damped 0 seconds
2016-02-10 15:21:40 [Worker_9] Info: cleaned 32 kbyte of memory from 2
closed SMTP connections
2016-02-10 15:22:10 [Main_Thread] Info: unable to detect any running worker
for a new connection - wait (max 30 seconds)
*repeat many times per second
2016-02-10 15:24:14 [Main_Thread] Info: ConnectionTransferTimeOut (30
seconds) is now reached
2016-02-10 15:24:14 [Main_Thread] Warning: Main_Thread is unable to
transfer connection to any worker - try again!
2016-02-10 15:24:14 [Main_Thread] Error: Main_Thread is unable to transfer
connection to any worker within 120 seconds - restart ASSP!
2016-02-10 15:24:14 [Main_Thread] Initializing shutdown sequence
I think I've had the monitoring kick in about 5 times and restart assp
today.
All the best,
Colin Waring
On Wed, Feb 10, 2016 at 4:14 PM, Scott MacLean <a...@hollsco.com> wrote:
> I'm still unable to run versions of ASSP newer than build 16018. I'm
> seeing the same exact behavior on any version newer than 16018, but
> reverting to 16018 fixes the problem.
>
> Something is killing the main thread, or causing long delays, which just
> makes everything start to fail.
>
> ASSP will run for about five minutes after it is started, with no
> problems. Then it will start getting slow. I'll see things like this in
> the log:
>
> Feb-10-16 10:28:29 [Worker_10000] Warning: Worker_10000 - check the
> 'ADO' database connections has taken 32.774 seconds (max=1.000s)
> Feb-10-16 10:28:31 [Worker_10000] Info: Name Server x.x.x.x:
> ResponseTime = 1232 ms for sourceforge.net
> Feb-10-16 10:28:31 [Worker_10000] Info: Name Server y.y.y.y:
> ResponseTime = 282 ms for sourceforge.net
> Feb-10-16 10:28:31 [Worker_10000] Info: Name Server z.z.z.z:
> ResponseTime = 11 ms for sourceforge.net
> Feb-10-16 10:28:31 [Worker_10000] Info: switched (DNS) nameserver order
> from x.x.x.x , y.y.y.y , z.z.z.z to z.z.z.z , x.x.x.x , y.y.y.y
>
> The nameserver on IP x.x.x.x is on the same machine as ASSP, and
> absolutely nothing is going on with the machine. All eight cores of the
> CPU are essentially sitting idle, there are many GB of memory free, the
> disk queue is sitting at near zero - the machine is essentially idle, so
> there is no reason why a DNS query would take 1232 ms. I can only assume
> then that the DNS query test that is being run by ASSP is taking too
> long because of a problem within ASSP, and the response time being shown
> is invalid.
>
> Similarly, the SQL server, also running on the local machine, is 100%
> healthy and is sitting idle, waiting for a query from ASSP. There is no
> way it should be sitting there for 32 seconds waiting for a connection.
>
> After this ASSP starts sending some queued email:
>
> Feb-10-16 10:28:32 [Worker_10000] Info: looking for files to (re)send
> Feb-10-16 10:28:42 [Worker_10000] (re)send - try to open:
> D:/ASSP/resendmail/n000000153.eml
> Feb-10-16 10:28:48 [Worker_10000] (re)send - process:
> D:/ASSP/resendmail/n000000153.eml (first time)
> Feb-10-16 10:28:49 [Worker_10000] (re)send -
> D:/ASSP/resendmail/n000000153.eml - From: r...@domain.com - To:
> r...@domain.com
> Feb-10-16 10:28:49 [Worker_10000] (re)send
> D:/ASSP/resendmail/n000000153.eml to host: 127.0.0.1:6027
> (smtpDestination)
> Feb-10-16 10:28:50 NB-18099-00275 [Worker_2] [MessageOK] y.y.y.y
> <u...@domain.com> to: u...@domain.com message ok - (whiteListedIPs
> 'y.y.y.y/32 New Web server') - [New password activation] ->
> D:/ASSP/notspam/275.eml
> Feb-10-16 10:28:58 [Worker_10000] Info: Net::SMTP is used to send mail
> Feb-10-16 10:29:00 [Worker_5] Worker_5 wakes up
> Feb-10-16 10:29:01 [Main_Thread] IP 127.0.0.1 matches
> allowStatConnectionsFrom - with 127.0.0.1/32
> Feb-10-16 10:29:03 [Worker_10000] Info: successful sent file
> D:/ASSP/resendmail/n000000153.eml to 127.0.0.1:6027 (smtpDestination)
> Feb-10-16 10:29:04 [Worker_10000] (re)send - try to open:
> D:/ASSP/resendmail/n000000224.eml
> Feb-10-16 10:29:04 [Worker_10000] (re)send - process:
> D:/ASSP/resendmail/n000000224.eml (first time)
> Feb-10-16 10:29:04 [Worker_10000] (re)send -
> D:/ASSP/resendmail/n000000224.eml - From: r...@netbound.com - To:
> r...@netbound.com
> Feb-10-16 10:29:04 [Worker_10000] (re)send
> D:/ASSP/resendmail/n000000224.eml to host: 127.0.0.1:6027
> (smtpDestination)
> Feb-10-16 10:29:04 [Worker_10000] Info: Net::SMTP is used to send mail
> Feb-10-16 10:29:05 [Main_Thread] Info: no (more) data readable
> (connection possibly closed by browser)
> Feb-10-16 10:29:06 [Worker_10000] Info: successful sent file
> D:/ASSP/resendmail/n000000224.eml to 127.0.0.1:3027 (smtpDestination)
> Feb-10-16 10:29:07 [Main_Thread] Info: no (more) data readable
> (connection possibly closed by browser)
> Feb-10-16 10:29:32 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:29:32 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:29:32 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:29:32 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:29:32 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
>
> And then it dies. It will write literally thousands and thousands of
> these "unable to detect any running worker" lines into the log - several
> hundred of them per second. This goes on for a couple of minutes until
> it just gives up and restarts:
>
> Feb-10-16 10:31:36 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:31:36 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:31:36 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:31:37 [Main_Thread] Info: unable to detect any running
> worker for a new connection - wait (max 30 seconds)
> Feb-10-16 10:31:37 [Main_Thread] Info: ConnectionTransferTimeOut (30
> seconds) is now reached
> Feb-10-16 10:31:37 [Main_Thread] Warning: Main_Thread is unable to
> transfer connection to any worker - try again!
> Feb-10-16 10:31:37 [Main_Thread] Info: notification message queued to
> sent to r...@netbound.com
> Feb-10-16 10:31:37 [Main_Thread] Error: Main_Thread is unable to
> transfer connection to any worker within 120 seconds - restart ASSP!
> Feb-10-16 10:31:38 [Main_Thread] Initializing shutdown sequence
> Feb-10-16 10:31:40 [Shutdown] Info: removing all SMTP and Proxy listeners
> Feb-10-16 10:31:41 [Shutdown] Tell Worker 3 - QUIT
> Feb-10-16 10:31:41 [Shutdown] Tell Worker 4 - QUIT
> Feb-10-16 10:31:41 [Shutdown] Tell Worker 5 - QUIT
> Feb-10-16 10:31:41 [Shutdown] Tell Worker 1 - QUIT
> Feb-10-16 10:31:41 [Shutdown] Tell Worker 2 - QUIT
> Feb-10-16 10:31:41 [Shutdown] Waiting for all SMTP-Workers to be finished
> Feb-10-16 10:31:45 [Shutdown] Warning: poll cycle (2) has taken
> 3.1647617816925 seconds - this is very much is too long
> Feb-10-16 10:32:08 NB-18036-09167 [Worker_3] 22.22.22.22
> <u...@domain.ca> to: u...@domain.on.ca info: received all data - all
> data moved to send queue (8)
> Feb-10-16 10:32:08 [Worker_3] Worker_3 has active connections. Will wait
> until all connections are finished but max 45 seconds!
> Feb-10-16 10:32:31 [Shutdown] Error: at least one of the SMTP workers
> has not finished work within 50 seconds
> Feb-10-16 10:32:33 [Shutdown] Closing all databases
> Feb-10-16 10:32:45 [Shutdown] Info: removing all WEB listeners
> Feb-10-16 10:32:45 [Shutdown] Info: shutdown reason was: restarting
> Feb-10-16 10:32:45 [Shutdown] ASSP finished work
>
> I need to find out what has changed in ASSP, and what I can do to fix
> this problem! Thomas, I did a compare between builds 16018 and 16021
> (16018 works fine, 16021 and newer exhibits the above behavior), and see
> very little - some SSL changes, griplist upload, not much else.
>
> Any idea where I could start to try to figure out what is going on?
>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Assp-test mailing list
> Assp-test@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/assp-test
>
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test