Hi Numenta community, I'm currently enrolled at NCSU, working on a team of seniors along with a local company to create an anomaly detection model for the company's web traffic. We have been given a set of sample apache server logs which were taken from their server, and have been modelling a small application around your NuPIC analysis model.
Initially, we were investigating the use of swarming for our task but deemed it inapplicable (with some input from a member of your community - revealing that we did not have enough contingent data to simulate a swarm correctly) and moved to using your basic model. Before developing an application (web server, database, javascript charting library, python interpreter stack) we were feeding our sample data into our NuPIC model and receiving a simple text file with the results. According to the text file, the anomaly scores for certain periods of troublesome (outage) server response times matched our expectations. We began moving forward with development, knowing our foundation was sound with the accurate results gained from utilizing your model. The most recent iteration of our development exposed an issue that we were not expecting, when visualizing periods of time (and plotting anomalous scores on the graph) we noticed that the anomaly scores were very different than expected. For instance, over a duration of four hours when the metric we are tracking (web page response time) spikes largely, the anomaly score barely increases. We are working on testing whether or not we can find a way to visualize the data set without experiencing this problem, but I wanted to reach out promptly and see if it is clear we are doing anything incorrectly. We have not modified the NuPIC model that is receiving data since working on the application and are having difficulty understanding what might be the issue. I can submit to your review any files you need, it will be great to correspond and try to find the root of our problem. We are using three python files to track and analyze the web-data input. One to create a NuPIC model for different URLs, a second to parse the large files and input to the database and a third file to detect anomalous scores according to the results of the model. To handle the large data, we aggregate over time periods of 15 seconds, storing a maximum and minimum (response time) value as well as the number of requests, count, over the period of time. We input the average of the response times into our NuPIC model for the specific URL and store the returned anomaly score in our time-series database. The problem occurred after we added visualization, using an API to call the database and propagate data to our chart control library (Google Charts) where we display our aggregate time-series data along with the anomaly scores. It was then, with anomaly scores way off the mark (0.106, for a large spike in response time, etc), that we realized our anomalous detection process was invalidated somewhere. Ultimately, we are looking for guidance as to where the root of our problem may lie. While I am sure I missed some detail that will help you investigate our problem, I will stay tuned to provide whatever it might be. We will be running tests with different data in an effort to standardize and isolate the issue, and I will incorporate that in the next message. Thank you for your time and consideration of our problem. Arthur Harris NCSU Senior CSC Department
AnomalyTry.py
Description: Binary data
LogParser.py
Description: Binary data
NuPICModel.py
Description: Binary data
