On Wed, Apr 1, 2009 at 1:08 PM, Tim Bass <tim.silkr...@gmail.com> wrote:

> I like the idea of clustering of mixture models and, think with a bit of
> effort, it would not be too difficult to create initial first order
> behavioral models.
> ...
> what might be the next steps?
>

I think the first steps are:

a) define and collect some sample data.  This should include real data and
some synthetic data for testing.

b) define the form of behavioral models for clustering

c) do an initial clustering

d) diagnose what didn't work as planned

You should say a bit about the data you have, but my guess is that you have
times, target host, general transaction type, source host and possibly user
name.  For a first step, I would use times, target host, transaction type
(or protocol or port) and source host.  I would obfuscate the source host by
picking a random salt and hashing the source IP or host name (and then
forget the key).  For synthetic data, I would generate several data files:

1) a mixture of 2,10 and 100 Poisson sources with rates selected over a
fairly wide range.  Each source should be identified by a single source host
and go to a single target host.

2) a mixture of 100 Poisson sources and 10 periodic sources with rates over
a wide range.

3) something more appropriate for the data that you have (and I don't know
about)

The simplest behavioral models would simply be Poisson sources with a known
access rate.  That would not, however, handle periodic or nearly periodic
traffic.  My recommendation would be to start with a data source with times
distributed according to a gamma distribution for the real data.  Initially,
I would not look at source and target hosts in the models.

The initial clustering should be run with differing amounts of the synthetic
data from (1) and Poisson models.  For small amounts of data, the clustering
should identify the high frequency components pretty easily.  With more
data, the lower frequency sources should become apparent.

With data (1) and gamma models, the system should determine that the sources
are largely Poisson.

With data (2) and Poisson models, it should be possible to show that the
periodic models are not well modeled, largely because the periodic data
won't be attached to a single model.  With data(2) and gamma models, the
Poisson sources should have appropriate shape parameters and the periodic
sources should have narrow range of predicted time between transactions.

I don't know how much data will be required for this unlabeled source
separation task and it may turn out to be difficult to see the low rate
signals against the high rate background.  Choice of priors will be
important to avoid describing every data set as being singleton observations
from a large number of very low rate sources.  There may also be convergence
issues.  For example, if you have a Poisson source with rate 1 and a
periodic source with rate 0.1, the mixture of a Poisson and a very narrow
gamma distribution would fit the data very well, but it would be very hard
to notice by accident that every 10 or so events occur at a constant rate.
With labels, this will be much easier, of course.

Depending on the results with synthetic data, it may be time to look at the
real data or we may need to move to more interesting models.

The next interesting model that I would be curious about would be one which
combines rate and target host.  This would likely be done using a gamma for
rate and multinomial for target host.  1 or more sources should be shared by
individual source hosts to emulate proxying and multi-tasking so source host
would provide additional information.  This could be used to specify a
multi-level generative model where each source host is modeled by multiple
traffic sources and the number of traffic sources for each source host and
the individual rate and shape parameters would be the latent variables for
the model.





-- 
Ted Dunning, CTO
DeepDyve

Reply via email to