On Wed, Apr 1, 2009 at 1:08 PM, Tim Bass <tim.silkr...@gmail.com> wrote:
> I like the idea of clustering of mixture models and, think with a bit of > effort, it would not be too difficult to create initial first order > behavioral models. > ... > what might be the next steps? > I think the first steps are: a) define and collect some sample data. This should include real data and some synthetic data for testing. b) define the form of behavioral models for clustering c) do an initial clustering d) diagnose what didn't work as planned You should say a bit about the data you have, but my guess is that you have times, target host, general transaction type, source host and possibly user name. For a first step, I would use times, target host, transaction type (or protocol or port) and source host. I would obfuscate the source host by picking a random salt and hashing the source IP or host name (and then forget the key). For synthetic data, I would generate several data files: 1) a mixture of 2,10 and 100 Poisson sources with rates selected over a fairly wide range. Each source should be identified by a single source host and go to a single target host. 2) a mixture of 100 Poisson sources and 10 periodic sources with rates over a wide range. 3) something more appropriate for the data that you have (and I don't know about) The simplest behavioral models would simply be Poisson sources with a known access rate. That would not, however, handle periodic or nearly periodic traffic. My recommendation would be to start with a data source with times distributed according to a gamma distribution for the real data. Initially, I would not look at source and target hosts in the models. The initial clustering should be run with differing amounts of the synthetic data from (1) and Poisson models. For small amounts of data, the clustering should identify the high frequency components pretty easily. With more data, the lower frequency sources should become apparent. With data (1) and gamma models, the system should determine that the sources are largely Poisson. With data (2) and Poisson models, it should be possible to show that the periodic models are not well modeled, largely because the periodic data won't be attached to a single model. With data(2) and gamma models, the Poisson sources should have appropriate shape parameters and the periodic sources should have narrow range of predicted time between transactions. I don't know how much data will be required for this unlabeled source separation task and it may turn out to be difficult to see the low rate signals against the high rate background. Choice of priors will be important to avoid describing every data set as being singleton observations from a large number of very low rate sources. There may also be convergence issues. For example, if you have a Poisson source with rate 1 and a periodic source with rate 0.1, the mixture of a Poisson and a very narrow gamma distribution would fit the data very well, but it would be very hard to notice by accident that every 10 or so events occur at a constant rate. With labels, this will be much easier, of course. Depending on the results with synthetic data, it may be time to look at the real data or we may need to move to more interesting models. The next interesting model that I would be curious about would be one which combines rate and target host. This would likely be done using a gamma for rate and multinomial for target host. 1 or more sources should be shared by individual source hosts to emulate proxying and multi-tasking so source host would provide additional information. This could be used to specify a multi-level generative model where each source host is modeled by multiple traffic sources and the number of traffic sources for each source host and the individual rate and shape parameters would be the latent variables for the model. -- Ted Dunning, CTO DeepDyve