[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals
Hello Pi-Web, Friday, November 2, 2007, 6:46:26 PM, you wrote: > On 8438 " Min=0, Max=7211. (57 scans took above 1000, 6384 scans took less than 101). > The server is rather old and serving both web mail, pop3 and smtp. > And heavy usage of web mail does slow it down. This might be the case on the > slow scans. > The long scans is not at the same time, but from time to time during the day. > Still this should not "lock up" snfserver. True. Is the server at least a p3? I have many production servers running now (hundreds) that never lock up -- so it's not likely that I will be able to reproduce this error. I recommend running the SNFServer component from the command line and pumping it's output to a file -- you might even run it in a loop in a .cmd script so that it will be sure to restart after a crash. By sending it's output to a file we should be able to see any errors that it reports on it's way down. This will give us somewhere to look. > To call snf we use a dll of own development (pluged in to Merak mail server). > The call to snfclient is done using a: WaitforSingleObject with INFINITE wait > time. > (perhaps we should change this). I think that's correct. However, since you've developed your own DLL, you might consider bypassing the client altogether and connecting to the SNFServer w/ TCP using the XCI protocol. On your somewhat overloaded server, launching an external process may be lending to performance reduction at the very least. > When it finish - and it does - we get the snf result using GetExitCodeProcess. > This return zero (whitch is good, else all messages would be rejected) when > the > snfserver is in the "Could Not Connect!" state. Right. The client will return a fail-safe result when it has a problem getting a real result. I have changed the maximum number of evaluators in the code. I hope to be able to put it up on the server some time today. However, I doubt that has anything to do with what's happening here. The max eval error is handled properly in the code and recovery is very simple and tidy. This particular case has been around for as long as the engine has been in place. Also, the max evals number is 1024 (now 2048) while almost every scan recorded is well below 100. This does cause me to wonder if it's a good idea to change this safety check at all. It seems more likely that the test is doing what it should - and perhaps detecting some corrupt memory (you did say the server is old). The test was originally designed as a sanity check to avoid having the scanner run off in a tight loop allocating itself out of memory due to a corrupt rulebase file. Anyway --- I doubt that the max evals condition is directly connected to the SNF Server shutdowns. SNFServer should tell us why it shuts down when that happens and we should be able to get that info if we run it from the command line and capture it's output. Hope this helps, _M -- Pete McNeil Chief Scientist, Arm Research Labs, LLC. # This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: <[EMAIL PROTECTED]> To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]> To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]> Send administrative queries to <[EMAIL PROTECTED]>
[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals
On 8438 " Friday, November 2, 2007, 5:04:47 PM, you wrote: The SNFserver.exe is present on the task list, so it will not automatic restart. "ERROR" in todays log: The ERROR_SYNC_FAILED errors are caused by network congestion between your systems and ours. Ping times are well above 120ms at the moment, for example. I note that there are periods of time when there is no trouble making the connection and your current telemetry also looks good so we can ignore that error for the time being. Your latest SYNC took only 290ms and occurred with no retries. Here is my telemetry on that: error='ERROR_MAX_EVALS'/> The above scan failed due to too many evaluators. ... cut a lot... ERROR_MSG_FILE indicates that the SNFServer program was unable to open or read the file. Something must have removed it before it could be processed. This error is unrelated to the SYNC and MAX_EVALS errors. I also noted that the SYNC errors do not seem to coincide closely with the MSG_FILE errors. For now we will need to treat all three as separate cases. On some systems we have found cases where the system becomes so busy that scans take too long and are then cancelled before they are complete. This condition might account for some of the MSG_FILE errors. Is there a timeout on the mechanism that calls the SNFClient? If there is, then we might be able to mitigate the ERROR_MSG_FILE condition by extending that timeout. Considering the SYNC errors -- they are not critical because the SNF engine will tolerate them provided it is able to make a connection most of the time. When a connection is made and the SYNC session is successful then all of the data from previously unsuccessful sessions is transferred in the process. " The element always "belongs to" an element. An element represents a single message scan. The element describes the system's performance during that scan. In the case of the element above, it took 0ms to setup the scan (read the file etc) and then took 411ms to perform the scan. This would usually indicate that your system is CPU bound. Normally an SNF scan will take a very short time. This one took almost half a second. The l indicates the length of the message scan in bytes and the d indicates the scan depth. That is, the maximum number of evaluators that were alive during the scan. ... error='ERROR_MAX_EVALS'/> ... The element here does not belong to the element. It belongs to a different scan. Once the element closes (with ) anything after that point belongs to a different event. --- I don't have any other reports of MAX_EVAL errors. That doesn't mean that they are not out there, but it does mean that they are not usually a problem for other folks. I'm not sure what can be causing your SNFServer to crash -- it should not be MAX_EVAL errors. They are handled safely by the code according to what I've seen so far in my search. None the less, I will be increasing the max eval setting in the next release and I will push it out sooner rather than later. Since you have reported this problem I won't wait for the other features before pushing out beta 1.6. If I can get to it tonight I will. In the mean time, do you have any idea what might be causing your CPU to be so heavily loaded that your SNF scans are taking 400+ milliseconds? Do you have many records that show high t values like that? (I do see the 80 that you reported above. That's on the high end of normal). Your telemetry shows about 10 msg/minute on average, 90% capture. This seems a low number for such high scan times. In contrast, I have a generic single CPU server that is currently showing 400-500 msg/minute w/ times in the 20-30ms range consistently. Hope this helps, Thanks, _M -- Mvh. Frank Jensen [EMAIL PROTECTED] www.pi.dk Imponerende, fascinerende og kæmpe Plakater f.eks. 149 x 149 = 629 kr Vi kan også lave plakat fra dit digitale foto www.plakatkunst.dk # This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: <[EMAIL PROTECTED]> To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]> To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]> Send administrative queries to <[EMAIL PROTECTED]>
[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals
Hello Pi-Web, Friday, November 2, 2007, 5:04:47 PM, you wrote: > The SNFserver.exe is present on the task list, so it will not automatic > restart. > "ERROR" in todays log: > text='ERROR_SYNC_FAILED'/> context='SNF_NETWORK' code='99' text='ERROR_SYNC_FAILED'/> u='20071102113453' context='SNF_NETWORK' code='99' > text='ERROR_SYNC_FAILED'/> The ERROR_SYNC_FAILED errors are caused by network congestion between your systems and ours. Ping times are well above 120ms at the moment, for example. I note that there are periods of time when there is no trouble making the connection and your current telemetry also looks good so we can ignore that error for the time being. Your latest SYNC took only 290ms and occurred with no retries. Here is my telemetry on that: > error='ERROR_MAX_EVALS'/> The above scan failed due to too many evaluators. > > ... cut a lot... > ERROR_MSG_FILE indicates that the SNFServer program was unable to open or read the file. Something must have removed it before it could be processed. This error is unrelated to the SYNC and MAX_EVALS errors. I also noted that the SYNC errors do not seem to coincide closely with the MSG_FILE errors. For now we will need to treat all three as separate cases. On some systems we have found cases where the system becomes so busy that scans take too long and are then cancelled before they are complete. This condition might account for some of the MSG_FILE errors. Is there a timeout on the mechanism that calls the SNFClient? If there is, then we might be able to mitigate the ERROR_MSG_FILE condition by extending that timeout. Considering the SYNC errors -- they are not critical because the SNF engine will tolerate them provided it is able to make a connection most of the time. When a connection is made and the SYNC session is successful then all of the data from previously unsuccessful sessions is transferred in the process. > " The element always "belongs to" an element. An element represents a single message scan. The element describes the system's performance during that scan. In the case of the element above, it took 0ms to setup the scan (read the file etc) and then took 411ms to perform the scan. This would usually indicate that your system is CPU bound. Normally an SNF scan will take a very short time. This one took almost half a second. The l indicates the length of the message scan in bytes and the d indicates the scan depth. That is, the maximum number of evaluators that were alive during the scan. > ... > error='ERROR_MAX_EVALS'/> > ... > The element here does not belong to the element. It belongs to a different scan. Once the element closes (with ) anything after that point belongs to a different event. --- I don't have any other reports of MAX_EVAL errors. That doesn't mean that they are not out there, but it does mean that they are not usually a problem for other folks. I'm not sure what can be causing your SNFServer to crash -- it should not be MAX_EVAL errors. They are handled safely by the code according to what I've seen so far in my search. None the less, I will be increasing the max eval setting in the next release and I will push it out sooner rather than later. Since you have reported this problem I won't wait for the other features before pushing out beta 1.6. If I can get to it tonight I will. In the mean time, do you have any idea what might be causing your CPU to be so heavily loaded that your SNF scans are taking 400+ milliseconds? Do you have many records that show high t values like that? (I do see the 80 that you reported above. That's on the high end of normal). Your telemetry shows about 10 msg/minute on average, 90% capture. This seems a low number for such high scan times. In contrast, I have a generic single CPU server that is currently showing 400-500 msg/minute w/ times in the 20-30ms range consistently. Hope this helps, Thanks, _M -- Pete McNeil Chief Scientist, Arm Research Labs, LLC. # This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: <[EMAIL PROTECTED]> To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]> To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]> Send administrative queries to <[EMAIL PROTECTED]>
[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals
The SNFserver.exe is present on the task list, so it will not automatic restart. "ERROR" in todays log: error='ERROR_MAX_EVALS'/> ... cut a lot... " ... error='ERROR_MAX_EVALS'/> ... Hello Pi-Web, Friday, November 2, 2007, 2:01:30 PM, you wrote: 31st oct. spam level raised, SNF was not validating the mails, the "snfclient.exe.err" shows lines like: C:\Program Files\Merak\temp\2007110215013101AF.tmp: Could Not Connect! Could not connect indicates (most likely) that the SNFServer was down. Any time the client produces a .err it is unusual. Normal errors are reported to the SNFengine's log file(s). We restrated the SNFserver (running as a service) and scans run smoothly until today (2nd nov.), where same issue happen: "Could Not Connect!". No errors in between. Something is knocking the server offline. The log also show: (first line). C:\Program Files\Merak\temp\200711021416181623.tmp: XCI Error!: snf_EngineHandler::MaxEvals Think this "MaxEvals" is what cause the error. Is it due to the engine getting to many mails to evaluate? No. MaxEvals is a condition that is theoretically possible but extremely rare. As a message is scanned, little "creatures" called evaluators are created and re-used during the scan to identify any patterns that might exist in the message. The scan depth metric indicates the peak number of evaluators that were alive during the scan. Normally this number is between 60 and 150 though it changes all the time. In order to detect possible rulebase corruption there is a hard-coded limit to the number of evaluators that are allowed to live for a particular scan. It is possible that this number needs to be adjusted. That hasn't happened in a while - but since you're not getting any other errors (that we know of) that's the most likely scenario. The number of evaluators that are alive at one time for a particular scan depends on the active rules in the rulebase and the data in the message. The number is almost impossible to predict though it does (and should) normally stay in a fairly restricted range. How do we avoid this? First, let's verify that there were no other errors. Please look in your snf log files and check for any elements. These will describe any other errors that occurred. If we find no other errors then I will make an adjustment to the maximum evals metric and we will go from there. While you are in your logs -- look a the (performance) elements and get an idea what the scan depth is typically. That will help us compare your system to others and to determine what the new limit should be. Originally the scan depth limit was designed to help detect possible corruption or unexpected conditions in the scanning engine. It's been there since the first version. It's a kind of sanity check -- Most likely it just needs to be adjusted since spam has changed so much over the years. In the early days scan depths were consistently well below 100 -- even in the 40-60 range. These days there are more abstracts in the rulebase so more creatures are required to get a comprehensive idea of what is in each message. Another thing I will look at is that this exception should be handled gracefully. I will look into this -- it may be that we want the SNFServer to fail under these conditions because it is a clue to something being out of adjustment -- In this case, probably just the limit setting. In the mean time, if you automatically restart your SNFServer after a failure it should be safe and will pick up any waiting clients before they fail in most cases. We also see this error, but this might be while restarting the service: C:\Program Files\Merak\temp\2007103119380319A0.tmp: XCI Error!: snf_EngineHandler::FileError Most likely this is a request coming from an snfclient after the message file has already been handled and moved out of the temp folder. The FileError exception indicates that the SNFServer could not open and/or read the file. Normally this wouldn't appear in a .err file - it would appear in the normal logs. If this error was in a snfclient .err file then I may need to look at the client code again. Hope this helps, Thanks, _M -- Mvh. Frank Jensen [EMAIL PROTECTED] www.pi.dk Imponerende, fascinerende og kæmpe Plakater f.eks. 149 x 149 = 629 kr Vi kan også lave plakat fra dit digitale foto www.plakatkunst.dk # This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: <[EMAIL PROTECTED]> To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]> To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]> Send administrative queries to <[EMAIL PROTECTED]>
[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals
Hello Pi-Web, Friday, November 2, 2007, 2:01:30 PM, you wrote: > 31st oct. spam level raised, SNF was not validating the mails, the > "snfclient.exe.err" > shows lines like: > C:\Program Files\Merak\temp\2007110215013101AF.tmp: Could Not Connect! Could not connect indicates (most likely) that the SNFServer was down. Any time the client produces a .err it is unusual. Normal errors are reported to the SNFengine's log file(s). > We restrated the SNFserver (running as a service) and scans run smoothly > until today (2nd nov.), > where same issue happen: "Could Not Connect!". No errors in between. Something is knocking the server offline. > The log also show: (first line). > C:\Program Files\Merak\temp\200711021416181623.tmp: XCI Error!: > snf_EngineHandler::MaxEvals > Think this "MaxEvals" is what cause the error. > Is it due to the engine getting to many mails to evaluate? No. MaxEvals is a condition that is theoretically possible but extremely rare. As a message is scanned, little "creatures" called evaluators are created and re-used during the scan to identify any patterns that might exist in the message. The scan depth metric indicates the peak number of evaluators that were alive during the scan. Normally this number is between 60 and 150 though it changes all the time. In order to detect possible rulebase corruption there is a hard-coded limit to the number of evaluators that are allowed to live for a particular scan. It is possible that this number needs to be adjusted. That hasn't happened in a while - but since you're not getting any other errors (that we know of) that's the most likely scenario. The number of evaluators that are alive at one time for a particular scan depends on the active rules in the rulebase and the data in the message. The number is almost impossible to predict though it does (and should) normally stay in a fairly restricted range. > How do we avoid this? First, let's verify that there were no other errors. Please look in your snf log files and check for any elements. These will describe any other errors that occurred. If we find no other errors then I will make an adjustment to the maximum evals metric and we will go from there. While you are in your logs -- look a the (performance) elements and get an idea what the scan depth is typically. That will help us compare your system to others and to determine what the new limit should be. Originally the scan depth limit was designed to help detect possible corruption or unexpected conditions in the scanning engine. It's been there since the first version. It's a kind of sanity check -- Most likely it just needs to be adjusted since spam has changed so much over the years. In the early days scan depths were consistently well below 100 -- even in the 40-60 range. These days there are more abstracts in the rulebase so more creatures are required to get a comprehensive idea of what is in each message. Another thing I will look at is that this exception should be handled gracefully. I will look into this -- it may be that we want the SNFServer to fail under these conditions because it is a clue to something being out of adjustment -- In this case, probably just the limit setting. In the mean time, if you automatically restart your SNFServer after a failure it should be safe and will pick up any waiting clients before they fail in most cases. > We also see this error, but this might be while restarting the service: > C:\Program Files\Merak\temp\2007103119380319A0.tmp: XCI Error!: > snf_EngineHandler::FileError Most likely this is a request coming from an snfclient after the message file has already been handled and moved out of the temp folder. The FileError exception indicates that the SNFServer could not open and/or read the file. Normally this wouldn't appear in a .err file - it would appear in the normal logs. If this error was in a snfclient .err file then I may need to look at the client code again. Hope this helps, Thanks, _M -- Pete McNeil Chief Scientist, Arm Research Labs, LLC. # This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: <[EMAIL PROTECTED]> To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]> To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]> Send administrative queries to <[EMAIL PROTECTED]>