[ 
https://issues.apache.org/jira/browse/METRON-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678076#comment-15678076
 ] 

ASF GitHub Bot commented on METRON-562:
---------------------------------------

Github user james-sirota commented on a diff in the pull request:

    https://github.com/apache/incubator-metron/pull/352#discussion_r88757519
  
    --- Diff: metron-analytics/metron-statistics/README.md ---
    @@ -0,0 +1,346 @@
    +# Statistics and Mathematical Functions
    +
    +A variety of non-trivial and advanced analytics make use of statistics
    +and advanced mathematical functions.  Particular, capturing the
    +statistical snapshots in a scalable way can open up doors for more
    +advanced analytics such as outlier analysis.  As such, this project is
    +aimed at capturing a robust set of statistical functions and
    +statistical-based algorithms in the form of Stellar functions.  These
    +functions can be used from everywhere where Stellar is used.
    +
    +##Stellar Functions
    +
    +### Mathematical Functions
    +* `ABS`
    +  * Description: Returns the absolute value of a number.
    +  * Input:
    +    * number - The number to take the absolute value of
    +  * Returns: The absolute value of the number passed in.
    +
    +
    +### Distributional Statistics
    +
    +* `STATS_ADD`
    +  * Description: Adds one or more input values to those that are used to 
calculate the summary statistics.
    +  * Input:
    +    * stats - The Stellar statistics object.  If null, then a new one is 
initialized.
    +    * value+ - One or more numbers to add
    +  * Returns: A Stellar statistics object
    +* `STATS_COUNT`
    +  * Description: Calculates the count of the values accumulated (or in the 
window if a window is used).
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The count of the values in the window or NaN if the 
statistics object is null.
    +* `STATS_GEOMETRIC_MEAN`
    +  * Description: Calculates the geometric mean of the accumulated values 
(or in the window if a window is used). See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The geometric mean of the values in the window or NaN if the 
statistics object is null.
    +* `STATS_INIT`
    +  * Description: Initializes a statistics object
    +  * Input:
    +    * window_size - The number of input data values to maintain in a 
rolling window in memory.  If window_size is equal to 0, then no rolling window 
is maintained. Using no rolling window is less memory intensive, but cannot 
calculate certain statistics like percentiles and kurtosis.
    +  * Returns: A Stellar statistics object
    +* `STATS_KURTOSIS`
    +  * Description: Calculates the kurtosis of the accumulated values (or in 
the window if a window is used).  See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The kurtosis of the values in the window or NaN if the 
statistics object is null.
    +* `STATS_MAX`
    +  * Description: Calculates the maximum of the accumulated values (or in 
the window if a window is used).
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The maximum of the accumulated values in the window or NaN if 
the statistics object is null.
    +* `STATS_MEAN`
    +  * Description: Calculates the mean of the accumulated values (or in the 
window if a window is used).
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The mean of the values in the window or NaN if the statistics 
object is null.
    +* `STATS_MERGE`
    +  * Description: Merges statistics objects.
    +  * Input:
    +    * statistics - A list of statistics objects
    +  * Returns: A Stellar statistics object
    +* `STATS_MIN`
    +  * Description: Calculates the minimum of the accumulated values (or in 
the window if a window is used).
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The minimum of the accumulated values in the window or NaN if 
the statistics object is null.
    +* `STATS_PERCENTILE`
    +  * Description: Computes the p'th percentile of the accumulated values 
(or in the window if a window is used).
    +  * Input:
    +    * stats - The Stellar statistics object
    +    * p - a double where 0 <= p < 1 representing the percentile
    +  * Returns: The p'th percentile of the data or NaN if the statistics 
object is null
    +* `STATS_POPULATION_VARIANCE`
    +  * Description: Calculates the population variance of the accumulated 
values (or in the window if a window is used).  See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The population variance of the values in the window or NaN if 
the statistics object is null.
    +* `STATS_QUADRATIC_MEAN`
    +  * Description: Calculates the quadratic mean of the accumulated values 
(or in the window if a window is used).  See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The quadratic mean of the values in the window or NaN if the 
statistics object is null.
    +* `STATS_SD`
    +  * Description: Calculates the standard deviation of the accumulated 
values (or in the window if a window is used).  See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The standard deviation of the values in the window or NaN if 
the statistics object is null.
    +* `STATS_SKEWNESS`
    +  * Description: Calculates the skewness of the accumulated values (or in 
the window if a window is used).  See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The skewness of the values in the window or NaN if the 
statistics object is null.
    +* `STATS_SUM`
    +  * Description: Calculates the sum of the accumulated values (or in the 
window if a window is used).
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The sum of the values in the window or NaN if the statistics 
object is null.
    +* `STATS_SUM_LOGS`
    +  * Description: Calculates the sum of the (natural) log of the 
accumulated values (or in the window if a window is used).  See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The sum of the (natural) log of the values in the window or 
NaN if the statistics object is null.
    +* `STATS_SUM_SQUARES`
    +  * Description: Calculates the sum of the squares of the accumulated 
values (or in the window if a window is used).
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The sum of the squares of the values in the window or NaN if 
the statistics object is null.
    +* `STATS_VARIANCE`
    +  * Description: Calculates the variance of the accumulated values (or in 
the window if a window is used).  See 
http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.2_Descriptive_statistics
 
    +  * Input:
    +    * stats - The Stellar statistics object
    +  * Returns: The variance of the values in the window or NaN if the 
statistics object is null.
    +
    +### Statistical Outlier Detection
    +
    +* `OUTLIER_MAD_STATE_MERGE`
    +  * Description: Update the statistical state required to compute the 
Median Absolute Deviation.
    +  * Input:
    +    * [state] - A list of Median Absolute Deviation States to merge.  
Generally these are states across time.
    +    * currentState? - The current state (optional)
    +  * Returns: The Median Absolute Deviation state
    +* `OUTLIER_MAD_ADD`
    +  * Description: Add a piece of data to the state.
    +  * Input:
    +    * state - The MAD state
    +    * value - The numeric value to add
    +  * Returns: The MAD state
    +* `OUTLIER_MAD_SCORE`
    +  * Description: Get the modified z-score normalized by the MAD: scale * | 
x_i - median(X) | / MAD.  See the first page of 
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
    +  * Input:
    +    * state - The MAD state
    +    * value - The numeric value to score
    +    * scale? - Optionally the scale to use when computing the modified 
z-score.  Default is `0.6745`, see the first page of 
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
    +  * Returns: The modified z-score 
    +
    +# Outlier Analysis
    +
    +A common desire is to find anomalies in numerical data.  To that end,
    +we have some simple statistical anomaly detectors.
    +
    +## Median Absolute Deviation
    +
    +Much has been written about this robust estimator.  See the first page
    +of 
http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf
    +for a good coverage of the good and the bad of MAD.  The usage, however
    +is fairly straightforward:
    +* Gather the statistical state required to compute the MAD
    +  * The distribution of the values of a univariate random variable over 
time.
    +  * The distribution of the absolute deviations of the values from the 
median.
    +* Use this statistical state to score unseen values.  The higher the 
score, the more unlike the previously seen data the value is.
    +
    +There are a couple of issues which make MAD a bit hard to compute.
    +First, the statistical state requires computing median, which can be
    +computationally expensive to compute exactly.  To get around this, we
    +use the OnlineStatisticalProvider to compute a sketch rather than the
    +exact median.  Secondly, the statistical state for seasonal data should
    +be limited to a fixed, trailing window.  We do this by ensuring that the
    +MAD state is mergeable and able to be queried from within the Profiler.
    +
    +### Example
    +
    +We will create a dummy data stream of gaussian noise to illustrate how
    +to use the MAD functionality along with the profiler to tag messages as
    +outliers or not.
    +
    +To do this, we will create a 
    +* data generator
    +* parser
    +* profiler profile
    +* enrichment and threat triage
    +
    +#### Data Generator
    +
    +We can create a simple python script to generate a stream of gaussian
    +noise at the frequency of one message per second as a python script
    +which should be saved at `~/rand_gen.py`:
    +```
    +#!/usr/bin/python
    +import random
    +import sys
    +import time
    +def main():
    +  mu = float(sys.argv[1])
    +  sigma = float(sys.argv[2])
    +  freq_s = int(sys.argv[3])
    +  while True:
    +    print str(random.gauss(mu, sigma))
    +    sys.stdout.flush()
    +    time.sleep(freq_s)
    +
    +if __name__ == '__main__':
    +  main()
    +```
    +
    +This script will take the following as arguments:
    +* The mean of the data generated
    +* The standard deviation of the data generated
    +* The frequency (in seconds) of the data generated
    +
    +#### The Parser
    +
    +We will create a parser that will take the single numbers in and create
    +a message with a field called `value` in them using the `CSVParser`.
    +
    +Add the following file to
    +`$METRON_HOME/config/zookeeper/parsers/mad.json`:
    +```
    +{
    +  "parserClassName" : "org.apache.metron.parsers.csv.CSVParser"
    + ,"sensorTopic" : "mad"
    + ,"parserConfig" : {
    +    "columns" : {
    +      "value_str" : 0
    +                }
    +                   }
    + ,"fieldTransformations" : [
    +    {
    +    "transformation" : "STELLAR"
    +   ,"output" : [ "value" ]
    +   ,"config" : {
    +      "value" : "TO_DOUBLE(value_str)"
    +               }
    +    }
    +                           ]
    +}
    +```
    +
    +#### Enrichment and Threat Intel
    +
    +We will set a threat triage level of `10` if a message generates a outlier 
score of more than 3.5.
    +This cutoff will depend on your data and should be adjusted based on the
    +assumed underlying distribution.  Note that under the assumptions of
    +normality, MAD will act as a robust estimator of the standard deviation, 
so the cutoff
    +should be considered the number of standard deviations away.  For other
    +distributions, there are other interpretations which will make sense in
    +the context of measuring the "degree different".  See
    
+http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
    +for a brief discussion of this.
    +
    +Create the following in
    +`$METRON_HOME/config/zookeeper/enrichments/mad.json`:
    +
    +```
    +{
    +  "index": "mad",
    +  "batchSize": 1,
    +  "enrichment": {
    +    "fieldMap": {
    +      "stellar" : {
    +        "config" : {
    +          "parser_score" : "OUTLIER_MAD_SCORE(OUTLIER_MAD_STATE_MERGE(
    +PROFILE_GET( 'sketchy_mad', 'global', 10, 'MINUTES') ), value)"
    +         ,"is_alert" : "if parser_score > 3.5 then true else is_alert"
    --- End diff --
    
    Can 3.5 be pulled out into a hyper parameter?


> Add rudimentary statistical outlier detection
> ---------------------------------------------
>
>                 Key: METRON-562
>                 URL: https://issues.apache.org/jira/browse/METRON-562
>             Project: Metron
>          Issue Type: New Feature
>            Reporter: Casey Stella
>            Assignee: Casey Stella
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> With the advent of the profiler, we can now capture state.  Furthermore, with 
> Stellar, we can capture statistical summaries.  We should provide rudimentary 
> outlier detection functionality in the form of Stellar functions that can 
> operate on captured state from the profiler.
> To begin, we should enable simple outlier tests using distance from a central 
> measure such as Median Absolute Deviation (see 
> http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to