[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals

2007-11-03 Thread Pete McNeil
Hello Pi-Web,

Friday, November 2, 2007, 6:46:26 PM, you wrote:

> On 8438 " Min=0, Max=7211. (57 scans took above 1000, 6384 scans took less than 101).

> The server is rather old and serving both web mail, pop3 and smtp.
> And heavy usage of web mail does slow it down. This might be the case on the 
> slow scans.

> The long scans is not at the same time, but from time to time during the day.

> Still this should not "lock up" snfserver.

True. Is the server at least a p3?

I have many production servers running now (hundreds) that never lock
up -- so it's not likely that I will be able to reproduce this error.

I recommend running the SNFServer component from the command line and
pumping it's output to a file -- you might even run it in a loop in a
.cmd script so that it will be sure to restart after a crash.

By sending it's output to a file we should be able to see any errors
that it reports on it's way down. This will give us somewhere to look.

> To call snf we use a dll of own development (pluged in to Merak mail server).

> The call to snfclient is done using a: WaitforSingleObject with INFINITE wait 
> time.
> (perhaps we should change this).

I think that's correct.

However, since you've developed your own DLL, you might consider
bypassing the client altogether and connecting to the SNFServer w/ TCP
using the XCI protocol.

On your somewhat overloaded server, launching an external process may
be lending to performance reduction at the very least.

> When it finish - and it does - we get the snf result using GetExitCodeProcess.
> This return zero (whitch is good, else all messages would be rejected) when 
> the
> snfserver is in the "Could Not Connect!" state.

Right. The client will return a fail-safe result when it has a problem
getting a real result.

I have changed the maximum number of evaluators in the code. I hope to
be able to put it up on the server some time today. However, I doubt
that has anything to do with what's happening here. The max eval error
is handled properly in the code and recovery is very simple and tidy.
This particular case has been around for as long as the engine has
been in place. Also, the max evals number is 1024 (now 2048) while
almost every scan recorded is well below 100.

This does cause me to wonder if it's a good idea to change this safety
check at all.

It seems more likely that the test is doing what it should - and
perhaps detecting some corrupt memory (you did say the server is old).
The test was originally designed as a sanity check to avoid having the
scanner run off in a tight loop allocating itself out of memory due to
a corrupt rulebase file.

Anyway --- I doubt that the max evals condition is directly connected
to the SNF Server shutdowns.

SNFServer should tell us why it shuts down when that happens and we
should be able to get that info if we run it from the command line and
capture it's output.

Hope this helps,

_M

-- 
Pete McNeil
Chief Scientist,
Arm Research Labs, LLC.


#
This message is sent to you because you are subscribed to
  the mailing list .
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>



[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals

2007-11-02 Thread Pi-Web - Frank Jensen


On 8438 "
Friday, November 2, 2007, 5:04:47 PM, you wrote:


The SNFserver.exe is present on the task list, so it will not automatic restart.



"ERROR" in todays log:





  




The ERROR_SYNC_FAILED errors are caused by network congestion between
your systems and ours. Ping times are well above 120ms at the moment,
for example. I note that there are periods of time when there is no
trouble making the connection and your current telemetry also looks
good so we can ignore that error for the time being.

Your latest SYNC took only 290ms and occurred with no retries. Here is
my telemetry on that:




error='ERROR_MAX_EVALS'/>


The above scan  failed due to too many evaluators.



... cut a lot...



ERROR_MSG_FILE indicates that the SNFServer program was unable to open
or read the file. Something must have removed it before it could be
processed. This error is unrelated to the SYNC and MAX_EVALS errors.

I also noted that the SYNC errors do not seem to coincide closely with
the MSG_FILE errors. For now we will need to treat all three as
separate cases.

On some systems we have found cases where the system becomes so busy
that scans take too long and are then cancelled before they are
complete. This condition might account for some of the MSG_FILE
errors.

Is there a timeout on the mechanism that calls the SNFClient?
If there is, then we might be able to mitigate the ERROR_MSG_FILE
condition by extending that timeout.

Considering the SYNC errors -- they are not critical because the SNF
engine will tolerate them provided it is able to make a connection
most of the time. When a connection is made and the SYNC session is
successful then all of the data from previously unsuccessful sessions
is transferred in the process.


"





The  element always "belongs to" an  element. An  element
represents a single message scan. The  element describes the
system's performance during that scan.

In the case of the  element above, it took 0ms to setup the scan
(read the file etc) and then took 411ms to perform the scan. This
would usually indicate that your system is CPU bound. Normally an SNF
scan will take a very short time. This one took almost half a second.

The l indicates the length of the message scan in bytes and the d
indicates the scan depth. That is, the maximum number of evaluators
that were alive during the scan.


...
error='ERROR_MAX_EVALS'/>

...



The  element here does not belong to the  element. It belongs
to a different scan.

Once the  element closes (with ) anything after that point
belongs to a different event.

---

I don't have any other reports of MAX_EVAL errors. That doesn't mean
that they are not out there, but it does mean that they are not
usually a problem for other folks.

I'm not sure what can be causing your SNFServer to crash -- it should
not be MAX_EVAL errors. They are handled safely by the code according
to what I've seen so far in my search.

None the less, I will be increasing the max eval setting in the next
release and I will push it out sooner rather than later. Since you
have reported this problem I won't wait for the other features before
pushing out beta 1.6. If I can get to it tonight I will.

In the mean time, do you have any idea what might be causing your CPU
to be so heavily loaded that your SNF scans are taking 400+
milliseconds?

Do you have many  records that show high t values like that? (I do
see the 80 that you reported above. That's on the high end of normal).

Your telemetry shows about 10 msg/minute on average, 90% capture. This
seems a low number for such high scan times. In contrast, I have a
generic single CPU server that is currently showing 400-500 msg/minute
w/ times in the 20-30ms range consistently.

Hope this helps,

Thanks,

_M




--
Mvh. Frank Jensen
[EMAIL PROTECTED]
www.pi.dk



Imponerende, fascinerende og kæmpe
Plakater f.eks. 149 x 149 = 629 kr
Vi kan også lave plakat fra dit digitale foto

www.plakatkunst.dk



#
This message is sent to you because you are subscribed to
 the mailing list .
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>



[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals

2007-11-02 Thread Pete McNeil
Hello Pi-Web,

Friday, November 2, 2007, 5:04:47 PM, you wrote:

> The SNFserver.exe is present on the task list, so it will not automatic 
> restart.

> "ERROR" in todays log:



>  text='ERROR_SYNC_FAILED'/>  context='SNF_NETWORK' code='99' text='ERROR_SYNC_FAILED'/>  u='20071102113453' context='SNF_NETWORK' code='99'
> text='ERROR_SYNC_FAILED'/>



The ERROR_SYNC_FAILED errors are caused by network congestion between
your systems and ours. Ping times are well above 120ms at the moment,
for example. I note that there are periods of time when there is no
trouble making the connection and your current telemetry also looks
good so we can ignore that error for the time being.

Your latest SYNC took only 290ms and occurred with no retries. Here is
my telemetry on that:



>  error='ERROR_MAX_EVALS'/>

The above scan  failed due to too many evaluators.

> 
> ... cut a lot...
> 

ERROR_MSG_FILE indicates that the SNFServer program was unable to open
or read the file. Something must have removed it before it could be
processed. This error is unrelated to the SYNC and MAX_EVALS errors.

I also noted that the SYNC errors do not seem to coincide closely with
the MSG_FILE errors. For now we will need to treat all three as
separate cases.

On some systems we have found cases where the system becomes so busy
that scans take too long and are then cancelled before they are
complete. This condition might account for some of the MSG_FILE
errors.

Is there a timeout on the mechanism that calls the SNFClient?
If there is, then we might be able to mitigate the ERROR_MSG_FILE
condition by extending that timeout.

Considering the SYNC errors -- they are not critical because the SNF
engine will tolerate them provided it is able to make a connection
most of the time. When a connection is made and the SYNC session is
successful then all of the data from previously unsuccessful sessions
is transferred in the process.

> " 

The  element always "belongs to" an  element. An  element
represents a single message scan. The  element describes the
system's performance during that scan.

In the case of the  element above, it took 0ms to setup the scan
(read the file etc) and then took 411ms to perform the scan. This
would usually indicate that your system is CPU bound. Normally an SNF
scan will take a very short time. This one took almost half a second.

The l indicates the length of the message scan in bytes and the d
indicates the scan depth. That is, the maximum number of evaluators
that were alive during the scan.

> ...
>  error='ERROR_MAX_EVALS'/>
> ...
> 

The  element here does not belong to the  element. It belongs
to a different scan.

Once the  element closes (with ) anything after that point
belongs to a different event.

---

I don't have any other reports of MAX_EVAL errors. That doesn't mean
that they are not out there, but it does mean that they are not
usually a problem for other folks.

I'm not sure what can be causing your SNFServer to crash -- it should
not be MAX_EVAL errors. They are handled safely by the code according
to what I've seen so far in my search.

None the less, I will be increasing the max eval setting in the next
release and I will push it out sooner rather than later. Since you
have reported this problem I won't wait for the other features before
pushing out beta 1.6. If I can get to it tonight I will.

In the mean time, do you have any idea what might be causing your CPU
to be so heavily loaded that your SNF scans are taking 400+
milliseconds?

Do you have many  records that show high t values like that? (I do
see the 80 that you reported above. That's on the high end of normal).

Your telemetry shows about 10 msg/minute on average, 90% capture. This
seems a low number for such high scan times. In contrast, I have a
generic single CPU server that is currently showing 400-500 msg/minute
w/ times in the 20-30ms range consistently.

Hope this helps,

Thanks,

_M

-- 
Pete McNeil
Chief Scientist,
Arm Research Labs, LLC.


#
This message is sent to you because you are subscribed to
  the mailing list .
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>



[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals

2007-11-02 Thread Pi-Web - Frank Jensen


The SNFserver.exe is present on the task list, so it will not automatic restart.

"ERROR" in todays log:












error='ERROR_MAX_EVALS'/>



... cut a lot...





"
...
error='ERROR_MAX_EVALS'/>

...





Hello Pi-Web,

Friday, November 2, 2007, 2:01:30 PM, you wrote:




31st oct. spam level raised, SNF was not validating the mails, the 
"snfclient.exe.err"
shows lines like:
C:\Program Files\Merak\temp\2007110215013101AF.tmp: Could Not Connect!


Could not connect indicates (most likely) that the SNFServer was down.
Any time the client produces a .err it is unusual. Normal errors are
reported to the SNFengine's log file(s).


We restrated the SNFserver (running as a service) and scans run smoothly until 
today (2nd nov.),
where same issue happen: "Could Not Connect!". No errors in between.


Something is knocking the server offline.


The log also show: (first line).
C:\Program Files\Merak\temp\200711021416181623.tmp: XCI Error!: 
snf_EngineHandler::MaxEvals



Think this "MaxEvals" is what cause the error.
Is it due to the engine getting to many mails to evaluate?


No. MaxEvals is a condition that is theoretically possible but
extremely rare. As a message is scanned, little "creatures" called
evaluators are created and re-used during the scan to identify any
patterns that might exist in the message. The scan depth metric
indicates the peak number of evaluators that were alive during the
scan. Normally this number is between 60 and 150 though it changes all
the time.

In order to detect possible rulebase corruption there is a hard-coded
limit to the number of evaluators that are allowed to live for a
particular scan. It is possible that this number needs to be adjusted.
That hasn't happened in a while - but since you're not getting any
other errors (that we know of) that's the most likely scenario.

The number of evaluators that are alive at one time for a particular
scan depends on the active rules in the rulebase and the data in the
message. The number is almost impossible to predict though it does
(and should) normally stay in a fairly restricted range.


How do we avoid this?


First, let's verify that there were no other errors. Please look in
your snf log files and check for any  elements. These will
describe any other errors that occurred.

If we find no other errors then I will make an adjustment to the
maximum evals metric and we will go from there.

While you are in your logs -- look a the  (performance) elements
and get an idea what the scan depth is typically. That will help us
compare your system to others and to determine what the new limit
should be.

Originally the scan depth limit was designed to help detect possible
corruption or unexpected conditions in the scanning engine. It's been
there since the first version. It's a kind of sanity check -- Most
likely it just needs to be adjusted since spam has changed so much
over the years. In the early days scan depths were consistently well
below 100 -- even in the 40-60 range. These days there are more
abstracts in the rulebase so more creatures are required to get a
comprehensive idea of what is in each message.

Another thing I will look at is that this exception should be handled
gracefully. I will look into this -- it may be that we want the
SNFServer to fail under these conditions because it is a clue to
something being out of adjustment -- In this case, probably just the
limit setting.

In the mean time, if you automatically restart your SNFServer after a
failure it should be safe and will pick up any waiting clients before
they fail in most cases.


We also see this error, but this might be while restarting the service:
C:\Program Files\Merak\temp\2007103119380319A0.tmp: XCI Error!: 
snf_EngineHandler::FileError


Most likely this is a request coming from an snfclient after the
message file has already been handled and moved out of the temp
folder.

The FileError exception indicates that the SNFServer could not open
and/or read the file. Normally this wouldn't appear in a .err file -
it would appear in the normal logs. If this error was in a snfclient
.err file then I may need to look at the client code again.

Hope this helps,

Thanks,

_M




--
Mvh. Frank Jensen
[EMAIL PROTECTED]
www.pi.dk



Imponerende, fascinerende og kæmpe
Plakater f.eks. 149 x 149 = 629 kr
Vi kan også lave plakat fra dit digitale foto

www.plakatkunst.dk



#
This message is sent to you because you are subscribed to
 the mailing list .
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>



[sniffer] Re: XCI Error!: snf_EngineHandler::MaxEvals

2007-11-02 Thread Pete McNeil
Hello Pi-Web,

Friday, November 2, 2007, 2:01:30 PM, you wrote:



> 31st oct. spam level raised, SNF was not validating the mails, the 
> "snfclient.exe.err"
> shows lines like:
> C:\Program Files\Merak\temp\2007110215013101AF.tmp: Could Not Connect!

Could not connect indicates (most likely) that the SNFServer was down.
Any time the client produces a .err it is unusual. Normal errors are
reported to the SNFengine's log file(s).

> We restrated the SNFserver (running as a service) and scans run smoothly 
> until today (2nd nov.),
> where same issue happen: "Could Not Connect!". No errors in between.

Something is knocking the server offline.

> The log also show: (first line).
> C:\Program Files\Merak\temp\200711021416181623.tmp: XCI Error!: 
> snf_EngineHandler::MaxEvals

> Think this "MaxEvals" is what cause the error.
> Is it due to the engine getting to many mails to evaluate?

No. MaxEvals is a condition that is theoretically possible but
extremely rare. As a message is scanned, little "creatures" called
evaluators are created and re-used during the scan to identify any
patterns that might exist in the message. The scan depth metric
indicates the peak number of evaluators that were alive during the
scan. Normally this number is between 60 and 150 though it changes all
the time.

In order to detect possible rulebase corruption there is a hard-coded
limit to the number of evaluators that are allowed to live for a
particular scan. It is possible that this number needs to be adjusted.
That hasn't happened in a while - but since you're not getting any
other errors (that we know of) that's the most likely scenario.

The number of evaluators that are alive at one time for a particular
scan depends on the active rules in the rulebase and the data in the
message. The number is almost impossible to predict though it does
(and should) normally stay in a fairly restricted range.

> How do we avoid this?

First, let's verify that there were no other errors. Please look in
your snf log files and check for any  elements. These will
describe any other errors that occurred.

If we find no other errors then I will make an adjustment to the
maximum evals metric and we will go from there.

While you are in your logs -- look a the  (performance) elements
and get an idea what the scan depth is typically. That will help us
compare your system to others and to determine what the new limit
should be.

Originally the scan depth limit was designed to help detect possible
corruption or unexpected conditions in the scanning engine. It's been
there since the first version. It's a kind of sanity check -- Most
likely it just needs to be adjusted since spam has changed so much
over the years. In the early days scan depths were consistently well
below 100 -- even in the 40-60 range. These days there are more
abstracts in the rulebase so more creatures are required to get a
comprehensive idea of what is in each message.

Another thing I will look at is that this exception should be handled
gracefully. I will look into this -- it may be that we want the
SNFServer to fail under these conditions because it is a clue to
something being out of adjustment -- In this case, probably just the
limit setting.

In the mean time, if you automatically restart your SNFServer after a
failure it should be safe and will pick up any waiting clients before
they fail in most cases.

> We also see this error, but this might be while restarting the service:
> C:\Program Files\Merak\temp\2007103119380319A0.tmp: XCI Error!: 
> snf_EngineHandler::FileError

Most likely this is a request coming from an snfclient after the
message file has already been handled and moved out of the temp
folder.

The FileError exception indicates that the SNFServer could not open
and/or read the file. Normally this wouldn't appear in a .err file -
it would appear in the normal logs. If this error was in a snfclient
.err file then I may need to look at the client code again.

Hope this helps,

Thanks,

_M

-- 
Pete McNeil
Chief Scientist,
Arm Research Labs, LLC.


#
This message is sent to you because you are subscribed to
  the mailing list .
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>