On 2013-06-27 20:01, Matt wrote:
I'm attaching a snippet of my log.  About 100 lines past the start, where you see a smattering of error messages, you then see a large block of them while the Sniffer service is restarting, and then after that no errors at all.  There have in fact been no errors at all in several hours since this restart of Sniffer.

I can promise you that the error you're reporting comes directly from a problem with the file system. Here is the code where that error is generated. Put simply the code tries to open the file and determine it's size. If that doesn't work it throws the ERROR_MSG_FILE exception in one of two forms -- that's what ends up in the log.

    try {                                                                       // Try opening the message file.
        MessageFile.open(MessageFilePath.c_str(), ios::in | ios::binary);       // Open the file, binary mode.
        MessageFile.seekg(0, ios::end);                                         // Find the end of the file,
        MessageFileSize = MessageFile.tellg();                                  // read that position as the size,
        MessageFile.seekg(0, ios::beg);                                         // then go back to the beginning.
        MyScanData.ScanSize = MessageFileSize;                                  // Capture the message file size.
    }
    catch(...) {                                                                // Trouble? Throw FileError.
        MyRulebase->MyLOGmgr.logThisError(                                      // Log the error.
          MyScanData, "scanMessageFile().open",
          snf_ERROR_MSG_FILE, "ERROR_MSG_FILE"
        );
        throw FileError("snf_EngineHandler::scanMessageFile() Open/Seek");
    }

    if(0 >= MessageFileSize) {                                                  // Handle zero length files.
        MessageFile.close();                                                    // No need to keep this open.
        MyRulebase->MyLOGmgr.logThisError(                                      // Log the error.
          MyScanData, "scanMessageFile().isFileEmpty?",
          snf_ERROR_MSG_FILE, "ERROR_MSG_FILE"
        );
        throw FileError("snf_EngineHandler::scanMessageFile() FileEmpty!");
    }

Another clue is that in the log snippet you provide, there are hints of a problem brewing when there are sporadic instances of this error. Then, when there is a large block -- virtually all requests to open the files for scan are rejected by the OS. Either something made those files unavailable, or the OS was unable to handle the request. I find it interesting also that the time required to report the error started at about 172 milliseconds and continued to climb to 406, 578, and then 656 before the restart.

SNF does not make log entries in the classic log during a restart, by the way.

Note also the timestamps associated with these events and you can see that the event was precipitated by a dramatic rise in message rates. The first part of your log seems to indicate about 7-10 messages per second. During the large block of errors, the message rate appears to have been in excess of 120 (I counted approximately 126 at timestamp 20130627183819). That's an increase at least an order of magnitude higher than the rate that was causing sporadic errors.

I suspect based on the data you have provided that something on your system generated a very large spike of activity that your IO subsystem was unable to manage and this caused snf scans to fail because snf was unable to open the files it was asked to scan.

Your restart of SNF apparently coincided with the event, but since all of the SMD file names are unique during the event, and since SNF has no way to generate scan requests on it's own, SNF does not appear to have been the cause of the event in any way. It was able to record the event, none the less.

So the question in my mind now is:

* Is there a way to improve your IO subsystem so that it can gain some headroom above 10 msg/sec?
* What caused the sudden dramatic spike that led to this episode?

On most tiny systems I monitor, scan times are generally < 100 ms. On your system they are frequently in excess of 400 ms which leads me to believe your system is a bit underpowered for it's current load.

Hope this helps,
_M

-- 
Pete McNeil
Chief Scientist
ARM Research Labs, LLC
www.armresearch.com
866-770-1044 x7010
twitter/codedweller 

#############################################################

This message is sent to you because you are subscribed to

  the mailing list <sniffer@sortmonster.com>.

This list is for discussing Message Sniffer,

Anti-spam, Anti-Malware, and related email topics.

For More information see http://www.armresearch.com

To unsubscribe, E-mail to: <sniffer-...@sortmonster.com>

To switch to the DIGEST mode, E-mail to <sniffer-dig...@sortmonster.com>

To switch to the INDEX mode, E-mail to <sniffer-in...@sortmonster.com>

Send administrative queries to  <sniffer-requ...@sortmonster.com>

Reply via email to