Re: Some questions for starters

Chuck Lam Tue, 07 Oct 2008 13:49:59 -0700

Werner,

The devil is in the details, but it sounds like your problem falls under the
general class of "anomaly detection," of which intrusion detection (as
mentioned by Isabel) is one application.

The basic kind of anomaly detection usually starts out with "hypothesis
testing" from statistics. Any intro book to statistics would have a chapter
on that. In particular, Student's t-test and chi-square testing are often
used. The textbook example of their application would be something like
deciding if the medical condition of one group of people (who, say, smokes)
is significantly different (i.e. worse) than another group.

Both Student's t-test and chi-square testing work on comparing two *groups*,
with one group generally consider "normal." If you want to detect a *single*
item/contract as being anomalous or not, usually you build a statistical
model over the "normal" group and see the probability of the single item
occurring under this model. Building a statistical model can be as simple as
fitting a Gaussian or can be super complicated. Again it depends on the
details of your problem and what assumptions you can make.

If you have labeled training data, then it becomes a classification problem.

I haven't thought through whether/which/how these algorithms can be
implemented in a Hadoop/MapReduce framework, but let's figure out which one
is more appropriate for your problem first...

On Tue, Oct 7, 2008 at 5:19 AM, Isabel Drost
<[EMAIL PROTECTED]>wrote:

> On Monday 06 October 2008, werner mueller wrote:
> > of course there is the option: let the user choose the limits. this has
> > two drawbacks:
> >  - where should the user know the limits from? and
> >  - some user have to look at thousands of contracts.
> > so i would prefer the system to work on its own (as much as possible).
>
> To me your problem looks something like the following:
>
> You have a number of clients. For each client you store the same kind of
> data
> sets, but the exact values differ depending on the client.
>
> Your task is - given previous datasets - classify new incoming data as
> normal
> or unusual.
>
> You already have identified several features that might help you classify
> the
> incoming data. The only thing you are missing is for each client a good
> combination of the features.
>
> I see two possibilities of solving your problem:
>
> There are algorithms for instance in the intrusion detection community that
> deal with the problem of discovering unusual data in a stream of normal
> data.
> You might find the algorithms you are looking for there. Maybe someone more
> familiar with this area than myself can answer your questions on this list.
>
> The second possibility would be to manually label previous datasets as "ok"
> and "strange", train a classifier on it and apply it to new incoming data.
> Only problem here: You need labeled data for each client, you need to
> retrain
> each time the data changes.
>
> Isabel
>
> --
> Vail's Second Axiom:    The amount of work to be done increases in
> proportion to
> the     amount of work already completed.
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[EMAIL PROTECTED]>
>

Re: Some questions for starters

Reply via email to