[sniffer] Re: FW: [sniffer] Re: Sniffer 3.0 Froze Mail Server
Hello Andy, Saturday, October 4, 2008, 10:21:31 PM, you wrote: > Hi Pete, Well, I eliminated WeightGate for the time being, just to do my “due diligence”. Also, since there is a fix sized buffer, I assume actually LOWERING the 3rd number (the allocation for each non-interactive process) would allow for MORE parallel processes to run (as long as the value is still large enough to support each of the applications that rely on it.) Of course, I assume the “heap” issue in reality is actually a SECONDARY problem ( a symptom of too many non-interactive tasks being launched and not completing). Since the ‘heap’ space is finite, there is a hard limit as to how many processes can be in a wait state at the same time. The problem to focus on is not the known, limited heap, but rather the reason why these processes were unable to complete and thus eventually too many processes being active. Indeed. Eliminating WeightGate might impact this because it will represent one less process per message. I just did a search of errors in the SNF logs and didn't find anything unusual. I was unable to pinpoint the time of the problem -- that will require a harder analysis of the data. Indications are that SNFServer didn't see any significant issues during the period covered by the two logs you sent. When client's talked to it they were served (according to the logs). You're showing about 40 msg/minute on average. According to a spot check of log entries SNFServer is finished processing these in an unmeasurable amount of time (0 indicates < 15 ms for both setup, read, scan, and response). Most of the logs performance metrics indicate s='0' and t='0' -- setup time in ms, and scan time in ms. On occasion I see some nonzero t values - but nothing unusual (16, 47, 63, etc). You probably don't need a lot of threads active on your system. If you have provided for a high number then you might consider reducing that number... Processing 1 message per second would exceed your average handily and doesn't take a lot of threads. If for some reason you were hit with a large number of messages and put them in work in parallel then that might have exhausted the heap. The new SNF is much more efficient than the old one and so it would have more easily allowed this... Sometimes introducing a more efficient component into a system exposes problems that were hidden by the previous less efficient component -- the less efficient component may have masked the problem by artificially reducing or shaping throughput. When we see this kind of thing we call it a "lens effect" -- the newer component reshapes the dynamics of the system and brings previously unknown problems "into focus". It's possible the heap problem you experienced was caused by a "lens effect" since the new SNF engine is more efficient and would naturally allow for more messages to be handled concurrently in a burst than the previous version would have allowed. A theory -- the previous version would naturally be constrained by I/O contention since it would need to create, scan, modify, and remove job control files. This would naturally couple performance to other I/O intensive operations such as writing new messages to the spool etc. The new version does not have any of this overhead and so would allow for an unconstrained ramp-up of new instances -- that might lead to a higher number of concurrent tasks and cause heap exhaustion--- after heap exhaustion is achieved additional tasks build up in a failed and partially initialized state. This typically continues until the failed tasks are manually removed -- since none of them is ever properly initialized none of the tasks can time out, fail, or shut down on their own. Hope this helps, _M -- Pete McNeil Chief Scientist, Arm Research Labs, LLC. # This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: <[EMAIL PROTECTED]> To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]> To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]> Send administrative queries to <[EMAIL PROTECTED]>
[sniffer] Re: FW: [sniffer] Re: Sniffer 3.0 Froze Mail Server
Hi Pete, Well, I eliminated WeightGate for the time being, just to do my "due diligence". Also, since there is a fix sized buffer, I assume actually LOWERING the 3rd number (the allocation for each non-interactive process) would allow for MORE parallel processes to run (as long as the value is still large enough to support each of the applications that rely on it.) Of course, I assume the "heap" issue in reality is actually a SECONDARY problem ( a symptom of too many non-interactive tasks being launched and not completing). Since the 'heap' space is finite, there is a hard limit as to how many processes can be in a wait state at the same time. The problem to focus on is not the known, limited heap, but rather the reason why these processes were unable to complete and thus eventually too many processes being active. Best Regards, Andy From: Pete McNeil [mailto:[EMAIL PROTECTED] Sent: Saturday, October 04, 2008 10:07 PM To: Andy Schmidt Cc: [EMAIL PROTECTED] Subject: Re: FW: [sniffer] Re: Sniffer 3.0 Froze Mail Server Hello Andy, Saturday, October 4, 2008, 9:22:39 PM, you wrote: > Hi Pete, Here the log files. I can't tell you WHEN the problem was triggered. I was off site and was alerted around noon that the SMTP service had become unresponsive. I assumed it had crashed, but found it running. Thus I tried restarting the SMTP service, but after shutting down, it wouldn't allow me to restart. That's when I started looking a bit more closely. Once I realized that I had all these SNFclient processes running (I checked the event log to see if it would give me any clue - but since the errors had been occurring for a while, my system event log had wrapped around, so I couldn't tell when it actually started and how long it may have taken between the actual problem and until the SMTP service became unresponsive. This Imail server is a PowerEdge 2950, Quad CPU, 3GHz. 2 GB of RAM and normally using about 1.5 GB of virtual RAM and on weekends, CPU load is usually below 10%. When this was going on, I didn't pay close attention because I wasn't quite sure yet what was going on and was trying to figure out how to get out of it. But, based on the memory use graph, I would guess it had maxed out 4 GB of virtual RAM, which eventually starved the SMTP service and prevented it from accepting more connections.. As soon as I flushed the command line programs, the memory curve dropped very sharply by easily half. Sorry - don't have anything more specific. I've been watching your telemetry and I don't think the problem was triggered by an ordinary overload. Your message rate is not high enough to cause that -- SNFClients will only wait about 30 seconds or so at most if they are unable to make contact - - even on the busiest systems. The other thing that strikes me is that you had to kill a lot of imailsrv.exe instances as well-- this is new and very different. Once the "mystery heap" was exhausted I would expect SNFClient instances to build up in a broken state (0x142) but there is no good reason for imailsrv instances to build up that I can think of -- except maybe some kind of list processing event? (IIRC, imailsrv is called to handle list processing requests through an alias -- it's been a while). I will check the SNF log to see if I can identify anything useful. Thanks, _M -- Pete McNeil Chief Scientist, Arm Research Labs, LLC.