[04/10] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

niketanpansare Mon, 13 Mar 2017 15:09:44 -0700

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveBivarStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveBivarStats.tex 
b/alg-ref/DescriptiveBivarStats.tex
new file mode 100644
index 0000000..a2d3db1
--- /dev/null
+++ b/alg-ref/DescriptiveBivarStats.tex
@@ -0,0 +1,438 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Bivariate Statistics}
+
+\noindent{\bf Description}
+\smallskip
+
+Bivariate statistics are used to quantitatively describe the association 
between
+two features, such as test their statistical (in-)dependence or measure
+the accuracy of one data feature predicting the other feature, in a sample.
+The \BivarScriptName{} script computes common bivariate statistics,
+such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs
+of data features.  For a given dataset matrix, script \BivarScriptName{} 
computes
+certain bivariate statistics for the given feature (column) pairs in the
+matrix.  The feature types govern the exact set of statistics computed for 
that pair.
+For example, \NameStatR{} can only be computed on two quantitative (scale)
+features like `Height' and `Temperature'. 
+It does not make sense to compute the linear correlation of two categorical 
attributes
+like `Hair Color'. 
+
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%\tolerance=0
+{\tt{}-f }path/\/\BivarScriptName{}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} index1=}path/file
+{\tt{} index2=}path/file
+{\tt{} types1=}path/file
+{\tt{} types2=}path/file
+{\tt{} OUTDIR=}path
+% {\tt{} fmt=}format
+
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the data matrix $X$ whose columns are the features
+that we want to compare and correlate with bivariate statistics.
+\item[{\tt index1}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the column indices
+of the \emph{first-argument} features in pairwise statistics.
+Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the
+index $k$ of column \texttt{X[,$\,k$]} in the data matrix
+whose bivariate statistics need to be computed.
+% The default value means ``use all $X$-columns from the first to the last.''
+\item[{\tt index2}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the column indices
+of the \emph{second-argument} features in pairwise statistics.
+Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the
+index $l$ of column \texttt{X[,$\,l$]} in the data matrix
+whose bivariate statistics need to be computed.
+% The default value means ``use all $X$-columns from the first to the last.''
+\item[{\tt types1}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the \emph{types}
+of the \emph{first-argument} features in pairwise statistics.
+Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the 
type
+of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the 
$i^{\textrm{th}}$
+entry in the {\tt index1} matrix.  Feature types must be encoded by
+integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+% The default value means ``treat all referenced $X$-columns as scale.''
+\item[{\tt types2}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the \emph{types}
+of the \emph{second-argument} features in pairwise statistics.
+Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the 
type
+of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the 
$j^{\textrm{th}}$
+entry in the {\tt index2} matrix.  Feature types must be encoded by
+integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+% The default value means ``treat all referenced $X$-columns as scale.''
+\item[{\tt OUTDIR}:]
+Location path (on HDFS) where the output matrices with computed bivariate
+statistics will be stored.  The matrices' file names and format are defined
+in Table~\ref{table:bivars}.
+% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+% see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+\begin{table}[t]\hfil
+\begin{tabular}{|lll|}
+\hline\rule{0pt}{12pt}%
+Ouput File / Matrix         & Row$\,$\# & Name of Statistic   \\[2pt]
+\hline\hline\rule{0pt}{12pt}%
+\emph{All Files}            &     1     & 1-st feature column \\
+\rule{1em}{0pt}"            &     2     & 2-nd feature column \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.scale.scale.stats     &     3     & \NameStatR          \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.nominal.nominal.stats &     3     & \NameStatChi        \\
+\rule{1em}{0pt}"            &     4     & Degrees of freedom  \\
+\rule{1em}{0pt}"            &     5     & \NameStatPChi       \\
+\rule{1em}{0pt}"            &     6     & \NameStatV          \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.nominal.scale.stats   &     3     & \NameStatEta        \\
+\rule{1em}{0pt}"            &     4     & \NameStatF          \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.ordinal.ordinal.stats &     3     & \NameStatRho        \\[2pt]
+\hline
+\end{tabular}\hfil
+\caption{%
+The output matrices of \BivarScriptName{} have one row per one bivariate
+statistic and one column per one pair of input features.  This table lists
+the meaning of each matrix and each row.%
+% Signs ``+'' show applicability to scale or/and to categorical features.
+}
+\label{table:bivars}
+\end{table}
+
+
+
+\pagebreak[2]
+
+\noindent{\bf Details}
+\smallskip
+
+Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns 
represent
+the features and whose rows represent the records of a data sample.
+Given \texttt{X}, the script computes certain relevant bivariate statistics
+for specified pairs of feature columns \texttt{X[,$\,i$]} and 
\texttt{X[,$\,j$]}.
+Command-line parameters \texttt{index1} and \texttt{index2} specify the files 
with
+column pairs of interest to the user.  Namely, the file given by 
\texttt{index1}
+contains the vector of the 1st-attribute column indices and the file given
+by \texttt{index2} has the vector of the 2nd-attribute column indices, with
+``1st'' and ``2nd'' referring to their places in bivariate statistics.
+Note that both \texttt{index1} and \texttt{index2} files should contain a 
1-row matrix
+of positive integers.
+
+The bivariate statistics to be computed depend on the \emph{types}, or
+\emph{measurement levels}, of the two columns.
+The types for each pair are provided in the files whose locations are 
specified by
+\texttt{types1} and \texttt{types2} command-line parameters.
+These files are also 1-row matrices, i.e.\ vectors, that list the 
1st-attribute and
+the 2nd-attribute column types in the same order as their indices in the
+\texttt{index1} and \texttt{index2} files.  The types must be provided as per
+the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+
+The script orgainizes its results into (potentially) four output matrices, one 
per
+each type combination.  The types of bivariate statistics are defined using 
the types
+of the columns that were used for their arguments, with ``ordinal'' sometimes
+retrogressing to ``nominal.''  Table~\ref{table:bivars} describes what each 
column
+in each output matrix contains.  In particular, the script includes the 
following
+statistics:
+\begin{Itemize}
+\item For a pair of scale (quantitative) columns, \NameStatR;
+\item For a pair of nominal columns (with finite-sized, fixed, unordered 
domains), 
+the \NameStatChi{} and its p-value;
+\item For a pair of one scale column and one nominal column, \NameStatF{};
+\item For a pair of ordinal columns (ordered domains depicting ranks), 
\NameStatRho.
+\end{Itemize}
+Note that, as shown in Table~\ref{table:bivars}, the output matrices contain 
the
+column indices of the features involved in each statistic.
+Moreover, if the output matrix does not contain
+a value in a certain cell then it should be interpreted as a~$0$
+(sparse matrix representation).
+
+Below we list all bivariate statistics computed by script \BivarScriptName.
+The statistics are collected into several groups by the type of their input
+features.  We refer to the two input features as $v_1$ and $v_2$ unless
+specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for 
$i=1,\ldots,n$,
+where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size.
+
+
+\paragraph{Scale-vs-scale statistics.}
+Sample statistics that describe association between two quantitative (scale) 
features.
+A scale feature has numerical values, with the natural ordering relation.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatR]:
+A measure of linear dependence between two numerical features:
+\begin{equation*}
+r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}}
+\,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}%
+{\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n 
(v_{2,i} - \bar{v}_2)^{2\mathstrut}}}
+\end{equation*}
+Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching 
${\pm}1$ when all value
+pairs $(v_{1,i}, v_{2,i})$ lie on the same line.  Correlation near~0 means 
that a line is not a good
+way to represent the dependence between the two features; however, this does 
not imply independence.
+The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if 
one feature tends to
+linearly increase (decrease) when the other feature increases.  Nonlinear 
association, if present,
+may disobey this sign.
+\NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if 
we transform $v_1$ and $v_2$
+to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$.
+
+Suppose that we use simple linear regression to represent one feature given 
the other, say
+represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and 
$\beta$
+to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta 
v_{1,i})^2$.
+Then the best error equals
+\begin{equation*}
+\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta 
v_{1,i}\big)^2 \,\,=\,\,
+(1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2
+\end{equation*}
+In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to
+the total sum of squares.  Hence, $r^2$ is an accuracy measure of the linear 
regression.
+\end{Description}
+
+
+\paragraph{Nominal-vs-nominal statistics.}
+Sample statistics that describe association between two nominal categorical 
features.
+Both features' value domains are encoded with positive integers in arbitrary 
order:
+nominal features do not order their value domains.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatChi]:
+A measure of how much the frequencies of value pairs of two categorical 
features deviate from
+statistical independence.  Under independence, the probability of every value 
pair must equal
+the product of probabilities of each value in the pair:
+$\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$.  But we do not know these 
(hypothesized) probabilities;
+we only know the sample frequency counts.  Let $n_{a,b}$ be the frequency 
count of pair
+$(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of 
$b$~alone.  Under
+independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be 
exactly~0 due
+to sample randomness, yet it is unlikely to be too far from~0.  For some pairs 
$(a,b)$ it may
+deviate from~0 farther than for other pairs.  \NameStatChi{}~is an aggregate 
measure that
+combines squares of these differences across all value pairs:
+\begin{equation*}
+\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - 
\frac{n_a n_b}{n}\Big)^2
+\,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}}
+\end{equation*}
+where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = 
(n_a n_b){/}n$ are
+the \emph{expected} frequencies for all pairs~$(a,b)$.  Under independence 
(plus other standard
+assumptions) the sample~$\chi^2$ closely follows a well-known distribution, 
making it a basis for
+statistical tests for independence, see~\emph{\NameStatPChi} for details.  
Note that \NameStatChi{}
+does \emph{not} measure the strength of dependence: even very weak dependence 
may result in a
+significant deviation from independence if the counts are large enough.  
Use~\NameStatV{} instead
+to measure the strength of dependence.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Degrees of freedom]:
+An integer parameter required for the interpretation of~\NameStatChi{} 
measure.  Under independence
+(plus other standard assumptions) the sample~$\chi^2$ statistic is 
approximately distributed as the
+sum of $d$~squares of independent normal random variables with mean~0 and 
variance~1, where $d$ is
+this integer parameter.  For a pair of categorical features such that the 
$1^{\textrm{st}}$~feature
+has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, 
the number of degrees
+of freedom is $d = (k_1 - 1)(k_2 - 1)$.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatPChi]:
+A measure of how likely we would observe the current frequencies of value 
pairs of two categorical
+features assuming their statistical independence.  More precisely, it computes 
the probability that
+the sum of $d$~squares of independent normal random variables with mean~0 and 
variance~1
+(called the $\chi^2$~distribution with $d$ degrees of freedom) generates a 
value at least as large
+as the current sample \NameStatChi.  The $d$ parameter is \emph{degrees of 
freedom}, see above.
+Under independence (plus other standard assumptions) the sample \NameStatChi{} 
closely follows the
+$\chi^2$~distribution and is unlikely to land very far into its tail.  On the 
other hand, if the
+two features are dependent, their sample \NameStatChi{} becomes arbitrarily 
large as $n\to\infty$
+and lands extremely far into the tail of the $\chi^2$~distribution given a 
large enough data sample.
+\NameStatPChi{} returns the tail ``weight'' on the right-hand side of 
\NameStatChi:
+\begin{equation*}
+P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim 
\textrm{the $\chi^2$ distribution}\big]
+\end{equation*}
+As any probability, $P$ ranges between 0 and~1.  If $P\leq 0.05$, the 
dependence between the two
+features may be considered statistically significant (i.e.\ their independence 
is considered
+statistically ruled out).  For highly dependent features, it is not unusual to 
have $P\leq 10^{-20}$
+or less, in which case our script will simply return $P = 0$.  Independent 
features should have
+their $P\geq 0.05$ in about 95\% cases.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatV]:
+A measure for the strength of association, i.e.\ of statistical dependence, 
between two categorical
+features, conceptually similar to \NameStatR.  It divides the 
observed~\NameStatChi{} by the maximum
+possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of 
categories in each feature,
+then takes the square root.  Thus, \NameStatV{} ranges from 0 to~1,
+where 0 implies no association and 1 implies the maximum possible association 
(one-to-one
+correspondence) between the two features.  See \emph{\NameStatChi} for the 
computation of~$\chi^2$;
+its maximum${} = {}$%
+$n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature
+has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ 
categories~\cite{AcockStavig1979:CramersV},
+so
+\begin{equation*}
+\textrm{\NameStatV} \,\,=\,\, 
\sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}}
+\end{equation*}
+As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' 
dependence increases,
+\NameStatV{} goes towards~1 (slowly) as the dependence increases.  Both 
\NameStatChi{} and
+\NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is 
mitigated by taking the
+ratio.
+\end{Description}
+
+
+\paragraph{Nominal-vs-scale statistics.}
+Sample statistics that describe association between a categorical feature
+(order ignored) and a quantitative (scale) feature.
+The values of the categorical feature must be coded as positive integers.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatEta]:
+A measure for the strength of association (statistical dependence) between a 
nominal feature
+and a scale feature, conceptually similar to \NameStatR.  Ranges from 0 to~1, 
approaching 0
+when there is no association and approaching 1 when there is a strong 
association.  
+The nominal feature, treated as the independent variable, is assumed to have 
relatively few
+possible values, all with large frequency counts.  The scale feature is 
treated as the dependent
+variable.  Denoting the nominal feature by~$x$ and the scale feature by~$y$, 
we have:
+\begin{equation*}
+\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - 
\hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
+\,\,\,\,\textrm{where}\,\,\,\,
+\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n  
+\,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & 
\textrm{otherwise}\end{array}\right.\!\!\!
+\end{equation*}
+and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean.  Value $\hat{y}[x]$ is 
the average 
+of~$y_i$ among all records where $x_i = x$; it can also be viewed as the 
``predictor'' 
+of $y$ given~$x$.  Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the 
residual error
+sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total 
sum-of-squares for~$y$. 
+Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the
+``R-squared'' statistic measures the accuracy of linear regression.  Our 
output $\eta$
+is the square root of~$\eta^2$.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatF]:
+A measure of how much the values of the scale feature, denoted here by~$y$,
+deviate from statistical independence on the nominal feature, denoted by~$x$.
+The same measure appears in the one-way analysis of vari\-ance (ANOVA).
+Like \NameStatChi, \NameStatF{} is used to test the hypothesis that
+$y$~is independent from~$x$, given the following assumptions:
+\begin{Itemize}
+\item The scale feature $y$ has approximately normal distribution whose mean
+may depend only on~$x$ and variance is the same for all~$x$;
+\item The nominal feature $x$ has relatively small value domain with large
+frequency counts, the $x_i$-values are treated as fixed (non-random);
+\item All records are sampled independently of each other.
+\end{Itemize}
+To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$
+among all records where $x_i = x$.  These $\hat{y}[x]$ can be viewed as
+``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should
+``predict'' only the global mean~$\bar{y}$.  Then we form two sums-of-squares:
+\begin{Itemize}
+\item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - 
\hat{y}[x_i]$;
+\item \emph{Explained} sum-of-squares of the ``predictor'' variability: 
$\hat{y}[x_i] - \bar{y}$.
+\end{Itemize}
+\NameStatF{} is the ratio of the explained sum-of-squares to
+the residual sum-of-squares, each divided by their corresponding degrees
+of freedom:
+\begin{equation*}
+F \,\,=\,\, 
+\frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 
\,\big/\,\, (k\,{-}\,1)}%
+{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} 
\,\,=\,\,
+\frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2}
+\end{equation*}
+Here $k$ is the domain size of the nominal feature~$x$.  The $k$ ``predictors''
+lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly,
+the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''.
+
+The statistic can test if the independence hypothesis of $y$ from $x$ is 
reasonable;
+more generally (with relaxed normality assumptions) it can test the hypothesis 
that
+\emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$.
+Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, 
n\,{-}\,k)$-distribution.
+But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{}
+becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely 
far
+into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large 
enough data sample.
+\end{Description}
+
+
+\paragraph{Ordinal-vs-ordinal statistics.}
+Sample statistics that describe association between two ordinal categorical 
features.
+Both features' value domains are encoded with positive integers, so that the 
natural
+order of the integers coincides with the order in each value domain.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatRho]:
+A measure for the strength of association (statistical dependence) between
+two ordinal features, conceptually similar to \NameStatR.  Specifically, it is 
\NameStatR{}
+applied to the feature vectors in which all values are replaced by their 
ranks, i.e.\ 
+their positions if the vector is sorted.  The ranks of identical (duplicate) 
values
+are replaced with their average rank.  For example, in vector
+$(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the 
sorted
+order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their 
average
+rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 
2,\, 5,\, 3.5,\, 1)$.
+
+Our implementation of \NameStatRho{} is geared towards features having small 
value domains
+and large counts for the values.  Given the two input vectors, we form a 
contingency table $T$
+of pairwise frequency counts, as well as a vector of frequency counts for each 
feature: $f_1$
+and~$f_2$.  Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer 
to the
+order-preserving integer encoding of the feature values.
+We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks:
+$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously 
for~$r_2$.
+Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, 
f_2$ and their
+covariance weighted by~$T$, before applying the standard formula for 
\NameStatR:
+\begin{equation*}
+\rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}}
+\,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - 
\bar{r}_2)}%
+{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} 
(r_{2,j} - \bar{r}_2)^{2\mathstrut}}}
+\end{equation*}
+where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$.
+The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the 
prevalent direction
+of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to 
increase (decrease)
+when the other feature increases.  The correlation becomes~1 when the two 
features are
+monotonically related.
+\end{Description}
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+A collection of (potentially) 4 matrices.  Each matrix contains bivariate 
statistics that
+resulted from a different combination of feature types.  There is one matrix 
for scale-scale
+statistics (which includes \NameStatR), one for nominal-nominal statistics 
(includes \NameStatChi{}),
+one for nominal-scale statistics (includes \NameStatF) and one for 
ordinal-ordinal statistics
+(includes \NameStatRho).  If any of these matrices is not produced, then no 
pair of columns required
+the corresponding type combination.  See Table~\ref{table:bivars} for the 
matrix naming and
+format details.
+
+
+\smallskip
+\pagebreak[2]
+
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f \BivarScriptName{} -nvargs
+X=/user/biadmin/X.mtx 
+index1=/user/biadmin/S1.mtx 
+index2=/user/biadmin/S2.mtx 
+types1=/user/biadmin/K1.mtx 
+types2=/user/biadmin/K2.mtx 
+OUTDIR=/user/biadmin/stats.mtx
+
+}
+


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveStats.tex b/alg-ref/DescriptiveStats.tex
new file mode 100644
index 0000000..5a59ad4
--- /dev/null
+++ b/alg-ref/DescriptiveStats.tex
@@ -0,0 +1,115 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}}
+\newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}}
+
+\newcommand{\OutputRowIDMinimum}{1}
+\newcommand{\OutputRowIDMaximum}{2}
+\newcommand{\OutputRowIDRange}{3}
+\newcommand{\OutputRowIDMean}{4}
+\newcommand{\OutputRowIDVariance}{5}
+\newcommand{\OutputRowIDStDeviation}{6}
+\newcommand{\OutputRowIDStErrorMean}{7}
+\newcommand{\OutputRowIDCoeffVar}{8}
+\newcommand{\OutputRowIDQuartiles}{?, 13, ?}
+\newcommand{\OutputRowIDMedian}{13}
+\newcommand{\OutputRowIDIQMean}{14}
+\newcommand{\OutputRowIDSkewness}{9}
+\newcommand{\OutputRowIDKurtosis}{10}
+\newcommand{\OutputRowIDStErrorSkewness}{11}
+\newcommand{\OutputRowIDStErrorCurtosis}{12}
+\newcommand{\OutputRowIDNumCategories}{15}
+\newcommand{\OutputRowIDMode}{16}
+\newcommand{\OutputRowIDNumModes}{17}
+\newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}}
+
+\newcommand{\NameStatR}{Pearson's correlation coefficient}
+\newcommand{\NameStatChi}{Pearson's~$\chi^2$}
+\newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$}
+\newcommand{\NameStatV}{Cram\'er's~$V$}
+\newcommand{\NameStatEta}{Eta statistic}
+\newcommand{\NameStatF}{$F$~statistic}
+\newcommand{\NameStatRho}{Spearman's rank correlation coefficient}
+
+Descriptive statistics are used to quantitatively describe the main 
characteristics of the data.
+They provide meaningful summaries computed over different observations or data 
records
+collected in a study.  These summaries typically form the basis of the initial 
data exploration
+as part of a more extensive statistical analysis.  Such a quantitative 
analysis assumes that
+every variable (also known as, attribute, feature, or column) in the data has 
a specific
+\emph{level of measurement}~\cite{Stevens1946:scales}.
+
+The measurement level of a variable, often called as {\bf variable type}, can 
either be
+\emph{scale} or \emph{categorical}.  A \emph{scale} variable represents the 
data measured on
+an interval scale or ratio scale.  Examples of scale variables include 
`Height', `Weight',
+`Salary', and `Temperature'.  Scale variables are also referred to as 
\emph{quantitative}
+or \emph{continuous} variables.  In contrast, a \emph{categorical} variable 
has a fixed
+limited number of distinct values or categories.  Examples of categorical 
variables
+include `Gender', `Region', `Hair color', `Zipcode', and `Level of 
Satisfaction'.
+Categorical variables can further be classified into two types, \emph{nominal} 
and
+\emph{ordinal}, depending on whether the categories in the variable can be 
ordered via an
+intrinsic ranking.  For example, there is no meaningful ranking among distinct 
values in
+`Hair color' variable, while the categories in `Level of Satisfaction' can be 
ranked from
+highly dissatisfied to highly satisfied.
+
+The input dataset for descriptive statistics is provided in the form of a 
matrix, whose
+rows are the records (data points) and whose columns are the features 
(i.e.~variables).
+Some scripts allow this matrix to be vertically split into two or three 
matrices.  Descriptive
+statistics are computed over the specified features (columns) in the matrix.  
Which
+statistics are computed depends on the types of the features.  It is important 
to keep
+in mind the following caveats and restrictions:
+\begin{Enumerate}
+\item  Given a finite set of data records, i.e.~a \emph{sample}, we take their 
feature
+values and compute their \emph{sample statistics}.  These statistics
+will vary from sample to sample even if the underlying distribution of feature 
values
+remains the same.  Sample statistics are accurate for the given sample only.
+If the goal is to estimate the \emph{distribution statistics} that are 
parameters of
+the (hypothesized) underlying distribution of the features, the corresponding 
sample
+statistics may sometimes be used as approximations, but their accuracy will 
vary.
+\item  In particular, the accuracy of the estimated distribution statistics 
will be low
+if the number of values in the sample is small.  That is, for small samples, 
the computed
+statistics may depend on the randomness of the individual sample values more 
than on
+the underlying distribution of the features.
+\item  The accuracy will also be low if the sample records cannot be assumed 
mutually
+independent and identically distributed (i.i.d.), that is, sampled at random 
from the
+same underlying distribution.  In practice, feature values in one record often 
depend
+on other features and other records, including unknown ones.
+\item  Most of the computed statistics will have low estimation accuracy in 
the presence of
+extreme values (outliers) or if the underlying distribution has heavy tails, 
for example
+obeys a power law.  However, a few of the computed statistics, such as the 
median and
+\NameStatRho{}, are \emph{robust} to outliers.
+\item  Some sample statistics are reported with their \emph{sample standard 
errors}
+in an attempt to quantify their accuracy as distribution parameter estimators. 
 But these
+sample standard errors, in turn, only estimate the underlying distribution's 
standard
+errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, 
outliers in samples,
+or heavy-tailed distributions.
+\item  We assume that the quantitative (scale) feature columns do not contain 
missing
+values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless 
otherwise
+specified.  We assume that each categorical feature column contains positive 
integers
+from 1 to the number of categories; for ordinal features, the natural order on
+the integers should coincide with the order on the categories.
+\end{Enumerate}
+
+\input{DescriptiveUnivarStats}
+
+\input{DescriptiveBivarStats}
+
+\input{DescriptiveStratStats}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveStratStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveStratStats.tex 
b/alg-ref/DescriptiveStratStats.tex
new file mode 100644
index 0000000..be0cffd
--- /dev/null
+++ b/alg-ref/DescriptiveStratStats.tex
@@ -0,0 +1,306 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Stratified Bivariate Statistics}
+
+\noindent{\bf Description}
+\smallskip
+
+The {\tt stratstats.dml} script computes common bivariate statistics, such
+as correlation, slope, and their p-value, in parallel for many pairs of input
+variables in the presence of a confounding categorical variable.  The values
+of this confounding variable group the records into strata (subpopulations),
+in which all bivariate pairs are assumed free of confounding.  The script
+uses the same data model as in one-way analysis of covariance (ANCOVA), with
+strata representing population samples.  It also outputs univariate stratified
+and bivariate unstratified statistics.
+
+\begin{table}[t]\hfil
+\begin{tabular}{|l|ll|ll|ll||ll|}
+\hline
+Month of the year & \multicolumn{2}{l|}{October} & 
\multicolumn{2}{l|}{November} &
+    \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\
+Customers, millions    & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\
+\hline
+Promotion (0 or 1)     & 0   & 1   & 0   & 1   & 0   & 1   & 0   & 1   \\
+Avg.\ sales per 1000   & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\
+\hline
+\end{tabular}\hfil
+\caption{Stratification example: the effect of the promotion on average sales
+becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the 
months.}
+\label{table:stratexample}
+\end{table}
+
+To see how data stratification mitigates confounding, consider an (artificial)
+example in Table~\ref{table:stratexample}.  A highly seasonal retail item
+was marketed with and without a promotion over the final 3~months of the year.
+In each month the sale was more likely with the promotion than without it.
+But during the peak holiday season, when shoppers came in greater numbers and
+bought the item more often, the promotion was less frequently used.  As a 
result,
+if the 4-th quarter data is pooled together, the promotion's effect becomes
+reversed and magnified.  Stratifying by month restores the positive 
correlation.
+
+The script computes its statistics in parallel over all possible pairs from two
+specified sets of covariates.  The 1-st covariate is a column in input 
matrix~$X$
+and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ 
may
+be the same or different.  The columns of interest are given by their index 
numbers
+in special matrices.  The stratum column, specified in its own matrix, is the 
same
+for all covariate pairs.
+
+Both covariates in each pair must be numerical, with the 2-nd covariate 
normally
+distributed given the 1-st covariate (see~Details).  Missing covariate values 
or
+strata are represented by~``NaN''.  Records with NaN's are selectively omitted
+wherever their NaN's are material to the output statistic.
+
+\smallskip
+\pagebreak[3]
+
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}stratstats.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Xcid=}path/file
+{\tt{} Y=}path/file
+{\tt{} Ycid=}path/file
+{\tt{} S=}path/file
+{\tt{} Scid=}int
+{\tt{} O=}path/file
+{\tt{} fmt=}format
+
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read matrix $X$ whose columns we want to use as
+the 1-st covariate (i.e.~as the feature variable)
+\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "})
+Location to read the single-row matrix that lists all index numbers
+of the $X$-columns used as the 1-st covariate; the default value means
+``use all $X$-columns''
+\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $Y$ whose columns we want to use as the 2-nd
+covariate (i.e.~as the response variable); the default value means
+``use $X$ in place of~$Y$''
+\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "})
+Location to read the single-row matrix that lists all index numbers
+of the $Y$-columns used as the 2-nd covariate; the default value means
+``use all $Y$-columns''
+\item[{\tt S}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $S$ that has the stratum column.
+Note: the stratum column must contain small positive integers; all fractional
+values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as
+missing.  The default value for {\tt S} means ``use $X$ in place of~$S$''
+\item[{\tt Scid}:] (default:\mbox{ }{\tt 1})
+The index number of the stratum column in~$S$
+\item[{\tt O}:]
+Location to store the output matrix defined in Table~\ref{table:stratoutput}
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\begin{table}[t]\small\hfil
+\begin{tabular}{|rcl|rcl|}
+\hline
+& Col.\# & Meaning & & Col.\# & Meaning \\
+\hline
+\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em}
+& 01     & $X$-column number                & 
+\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em}
+& 11     & $Y$-column number                \\
+& 02     & presence count for $x$           & 
+& 12     & presence count for $y$           \\
+& 03     & global mean $(x)$                & 
+& 13     & global mean $(y)$                \\
+& 04     & global std.\ dev. $(x)$          & 
+& 14     & global std.\ dev. $(y)$          \\
+& 05     & stratified std.\ dev. $(x)$      & 
+& 15     & stratified std.\ dev. $(y)$      \\
+& 06     & $R^2$ for $x \sim {}$strata      & 
+& 16     & $R^2$ for $y \sim {}$strata      \\
+& 07     & adjusted $R^2$ for $x \sim {}$strata      & 
+& 17     & adjusted $R^2$ for $y \sim {}$strata      \\
+& 08     & p-value, $x \sim {}$strata       & 
+& 18     & p-value, $y \sim {}$strata       \\
+& 09--10 & reserved                         & 
+& 19--20 & reserved                         \\
+\hline
+\multirow{9}{*}{\begin{sideways}$y\sim x$, NO 
strata\end{sideways}}\hspace{-1.15em}
+& 21     & presence count $(x, y)$          &
+\multirow{10}{*}{\begin{sideways}$y\sim x$ AND 
strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em}
+& 31     & presence count $(x, y, s)$       \\
+& 22     & regression slope                 &
+& 32     & regression slope                 \\
+& 23     & regres.\ slope std.\ dev.        &
+& 33     & regres.\ slope std.\ dev.        \\
+& 24     & correlation${} = \pm\sqrt{R^2}$  &
+& 34     & correlation${} = \pm\sqrt{R^2}$  \\
+& 25     & residual std.\ dev.              &
+& 35     & residual std.\ dev.              \\
+& 26     & $R^2$ in $y$ due to $x$          &
+& 36     & $R^2$ in $y$ due to $x$          \\
+& 27     & adjusted $R^2$ in $y$ due to $x$ &
+& 37     & adjusted $R^2$ in $y$ due to $x$ \\
+& 28     & p-value for ``slope = 0''        &
+& 38     & p-value for ``slope = 0''        \\
+& 29     & reserved                         &
+& 39     & \# strata with ${\geq}\,2$ count \\
+& 30     & reserved                         &
+& 40     & reserved                         \\
+\hline
+\end{tabular}\hfil
+\caption{The {\tt stratstats.dml} output matrix has one row per each distinct
+pair of 1-st and 2-nd covariates, and 40 columns with the statistics described
+here.}
+\label{table:stratoutput}
+\end{table}
+
+
+
+
+\noindent{\bf Details}
+\smallskip
+
+Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ 
is
+a stratum number and $(x, y)$ are two numerical covariates.  We want to analyze
+conditional linear relationship between $y$ and $x$ conditioned by~$i$.
+Note that $x$, but not~$y$, may represent a categorical variable if we assign a
+numerical value to each category, for example 0 and 1 for two categories.
+
+We assume a linear regression model for~$y$:
+\begin{equation}
+y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, 
\quad\textrm{where}\,\,\,\,
+\eps_{i,j} \sim \Normal(0, \sigma^2)
+\label{eqn:stratlinmodel}
+\end{equation}
+Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record 
number
+in stratum~$i$; by $n_i$ we denote the number of records available in 
stratum~$i$.
+The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata.
+When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in
+stratum~$i$ as
+\begin{equation*}
+\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad
+\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i
+\end{equation*}
+If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta 
\bar{x}_i$,
+which gives the prediction error sum-of-squares of
+\begin{equation}
+\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - 
(\bar{y}_i - \beta \bar{x}_i)\big)^2
+\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y
+\label{eqn:stratsumsq}
+\end{equation}
+where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' 
sample
+estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ 
computed as
+\begin{align*}
+V_x     \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - 
\bar{x}_i\big)^2; \quad
+V_y     \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - 
\bar{y}_i\big)^2;\\
+V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - 
\bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big)
+\end{align*}
+They are stratified because we compute the sample (co-)variances in each 
stratum~$i$
+separately, then combine by summation.  The stratified estimates for $\Var(X)$ 
and $\Var(Y)$
+tend to be smaller than the non-stratified ones (with the global mean instead 
of $\bar{x}_i$
+and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and 
$y_{i,j}$
+than the global means.  The stratified variance estimates the uncertainty in 
$x_{i,j}$ 
+and~$y_{i,j}$ given their stratum~$i$.
+
+Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq})
+gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$},
+with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS):
+\begin{equation*}
+\mathrm{RSS} \,\,=\, \,
+\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - 
+\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2
+\,\,=\,\,  V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big)
+\end{equation*}
+The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, 
estimates the fraction
+of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the 
linear 
+regression model~(\ref{eqn:stratlinmodel}).  We define \emph{stratified 
correlation} as the
+square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$.  We also use RSS 
to estimate
+the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that 
models the prediction error
+of $y_{i,j}$ given $x_{i,j}$ and the stratum:
+\begin{equation*}
+\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, 
\frac{V_{x,y}}{\sqrt{V_x V_y}};
+\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y};
+\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\,
+\Big(n = \sum_{i=1}^k n_i\Big)
+\end{equation*}
+
+The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are
+obtained by considering the effect of $\hat{\beta}$ on the residual 
sum-of-squares,
+measured by the decrease from $V_y$ to~RSS.
+The $F$-statistic is the ratio of the ``explained'' sum-of-squares
+to the residual sum-of-squares, divided by their corresponding degrees of 
freedom.
+There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces 
that
+to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 
degree
+of freedom:
+\begin{equation*}
+F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)}
+\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad
+t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}.
+\end{equation*}
+The $t$-statistic is simply the square root of the $F$-statistic with the 
appropriate
+choice of sign.  If the null hypothesis and the linear model are both true, 
the $t$-statistic
+has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom.  We 
can
+also compute it if we divide $\hat{\beta}$ by its estimated standard deviation:
+\begin{equation*}
+\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} 
\quad\Longrightarrow\quad
+t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, 
\stdev(\hat{\beta})_{\mathrm{est}}
+\end{equation*}
+The standard deviation estimate for~$\beta$ is included in {\tt 
stratstats.dml} output.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The output matrix format is defined in Table~\ref{table:stratoutput}.
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx 
Xcid=/user/biadmin/Xcid.mtx
+  Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx 
Scid=2
+  O=/user/biadmin/Out.mtx fmt=csv
+
+}
+{\hangindent=\parindent\noindent\tt
+\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx 
Xcid=/user/biadmin/Xcid.mtx
+  Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx
+
+}
+
+%\smallskip
+%\noindent{\bf See Also}
+%\smallskip
+%
+%For non-stratified bivariate statistics with a wider variety of input data 
types
+%and statistical tests, see \ldots.  For general linear regression, see
+%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}.  For logistic regression, 
appropriate
+%when the response variable is categorical, see {\tt MultiLogReg.dml}.
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveUnivarStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveUnivarStats.tex 
b/alg-ref/DescriptiveUnivarStats.tex
new file mode 100644
index 0000000..5838e3e
--- /dev/null
+++ b/alg-ref/DescriptiveUnivarStats.tex
@@ -0,0 +1,603 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Univariate Statistics}
+
+\noindent{\bf Description}
+\smallskip
+
+\emph{Univariate statistics} are the simplest form of descriptive statistics 
in data
+analysis.  They are used to quantitatively describe the main characteristics 
of each
+feature in the data.  For a given dataset matrix, script \UnivarScriptName{} 
computes
+certain univariate statistics for each feature column in the
+matrix.  The feature type governs the exact set of statistics computed for 
that feature.
+For example, the statistic \emph{mean} can only be computed on a quantitative 
(scale)
+feature like `Height' and `Temperature'.  It does not make sense to compute 
the mean
+of a categorical attribute like `Hair Color'.
+
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%\tolerance=0
+{\tt{}-f } \UnivarScriptName{}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} TYPES=}path/file
+{\tt{} STATS=}path/file
+% {\tt{} fmt=}format
+
+}
+
+
+\medskip
+\pagebreak[2]
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the data matrix $X$ whose columns we want to
+analyze as the features.
+\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$
+column-cell contains the type of the $i^{\textrm{th}}$ feature column
+\texttt{X[,$\,i$]} in the data matrix.  Feature types must be encoded by
+integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+% The default value means ``treat all $X$-columns as scale.''
+\item[{\tt STATS}:]
+Location (on HDFS) where the output matrix of computed statistics
+will be stored.  The format of the output matrix is defined by
+Table~\ref{table:univars}.
+% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+% see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+\begin{table}[t]\hfil
+\begin{tabular}{|rl|c|c|}
+\hline
+\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & 
\multicolumn{2}{c|}{Applies to:} \\
+                            &                            & Scale & Categ.\\
+\hline
+\OutputRowIDMinimum         & Minimum                    &   +   &       \\
+\OutputRowIDMaximum         & Maximum                    &   +   &       \\
+\OutputRowIDRange           & Range                      &   +   &       \\
+\OutputRowIDMean            & Mean                       &   +   &       \\
+\OutputRowIDVariance        & Variance                   &   +   &       \\
+\OutputRowIDStDeviation     & Standard deviation         &   +   &       \\
+\OutputRowIDStErrorMean     & Standard error of mean     &   +   &       \\
+\OutputRowIDCoeffVar        & Coefficient of variation   &   +   &       \\
+\OutputRowIDSkewness        & Skewness                   &   +   &       \\
+\OutputRowIDKurtosis        & Kurtosis                   &   +   &       \\
+\OutputRowIDStErrorSkewness & Standard error of skewness &   +   &       \\
+\OutputRowIDStErrorCurtosis & Standard error of kurtosis &   +   &       \\
+\OutputRowIDMedian          & Median                     &   +   &       \\
+\OutputRowIDIQMean          & Inter quartile mean        &   +   &       \\
+\OutputRowIDNumCategories   & Number of categories       &       &   +   \\
+\OutputRowIDMode            & Mode                       &       &   +   \\
+\OutputRowIDNumModes        & Number of modes            &       &   +   \\
+\hline
+\end{tabular}\hfil
+\caption{The output matrix of \UnivarScriptName{} has one row per each
+univariate statistic and one column per input feature.  This table lists
+the meaning of each row.  Signs ``+'' show applicability to scale or/and
+to categorical features.}
+\label{table:univars}
+\end{table}
+
+
+\pagebreak[1]
+
+\smallskip
+\noindent{\bf Details}
+\smallskip
+
+Given an input matrix \texttt{X}, this script computes the set of all
+relevant univariate statistics for each feature column \texttt{X[,$\,i$]}
+in~\texttt{X}.  The list of statistics to be computed depends on the
+\emph{type}, or \emph{measurement level}, of each column.
+The \textrm{TYPES} command-line argument points to a vector containing
+the types of all columns.  The types must be provided as per the following
+convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+
+Below we list all univariate statistics computed by script \UnivarScriptName.
+The statistics are collected by relevance into several groups, namely: central
+tendency, dispersion, shape, and categorical measures.  The first three groups
+contain statistics computed for a quantitative (also known as: numerical, 
scale,
+or continuous) feature; the last group contains the statistics for a 
categorical
+(either nominal or ordinal) feature.  
+
+Let~$n$ be the number of data records (rows) with feature values.
+In what follows we fix a column index \texttt{idx} and consider
+sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}.
+Let $v = (v_1, v_2, \ldots, v_n)$ be the values of 
\texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}
+in their original unsorted order: $v_i = 
\texttt{X[}i\texttt{,}\,\texttt{idx]}$.
+Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted 
order,
+preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
+
+\paragraph{Central tendency measures.}
+Sample statistics that describe the location of the quantitative (scale) 
feature distribution,
+represent it with a single value.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Mean]
+\OutputRowText{\OutputRowIDMean}
+The arithmetic average over a sample of a quantitative feature.
+Computed as the ratio between the sum of values and the number of values:
+$\left(\sum_{i=1}^n v_i\right)\!/n$.
+Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 
7.8$\}$
+equals~5.2.
+
+Note that the mean is significantly affected by extreme values in the sample
+and may be misleading as a central tendency measure if the feature varies on
+exponential scale.  For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 
100.0$\}$
+is 22.222, greater than all the sample values except the~largest.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\begin{figure}[t]
+\setlength{\unitlength}{10pt}
+\begin{picture}(33,12)
+\put( 6.2, 0.0){\small 2.2}
+\put(10.2, 0.0){\small 3.2}
+\put(12.2, 0.0){\small 3.7}
+\put(15.0, 0.0){\small 4.4}
+\put(18.6, 0.0){\small 5.3}
+\put(20.2, 0.0){\small 5.7}
+\put(21.75,0.0){\small 6.1}
+\put(23.05,0.0){\small 6.4}
+\put(26.2, 0.0){\small 7.2}
+\put(28.6, 0.0){\small 7.8}
+\put( 0.5, 0.7){\small 0.0}
+\put( 0.1, 3.2){\small 0.25}
+\put( 0.5, 5.7){\small 0.5}
+\put( 0.1, 8.2){\small 0.75}
+\put( 0.5,10.7){\small 1.0}
+\linethickness{1.5pt}
+\put( 2.0, 1.0){\line(1,0){4.8}}
+\put( 6.8, 1.0){\line(0,1){1.0}}
+\put( 6.8, 2.0){\line(1,0){4.0}}
+\put(10.8, 2.0){\line(0,1){1.0}}
+\put(10.8, 3.0){\line(1,0){2.0}}
+\put(12.8, 3.0){\line(0,1){1.0}}
+\put(12.8, 4.0){\line(1,0){2.8}}
+\put(15.6, 4.0){\line(0,1){1.0}}
+\put(15.6, 5.0){\line(1,0){3.6}}
+\put(19.2, 5.0){\line(0,1){1.0}}
+\put(19.2, 6.0){\line(1,0){1.6}}
+\put(20.8, 6.0){\line(0,1){1.0}}
+\put(20.8, 7.0){\line(1,0){1.6}}
+\put(22.4, 7.0){\line(0,1){1.0}}
+\put(22.4, 8.0){\line(1,0){1.2}}
+\put(23.6, 8.0){\line(0,1){1.0}}
+\put(23.6, 9.0){\line(1,0){3.2}}
+\put(26.8, 9.0){\line(0,1){1.0}}
+\put(26.8,10.0){\line(1,0){2.4}}
+\put(29.2,10.0){\line(0,1){1.0}}
+\put(29.2,11.0){\line(1,0){4.8}}
+\linethickness{0.3pt}
+\put( 6.8, 1.0){\circle*{0.3}}
+\put(10.8, 1.0){\circle*{0.3}}
+\put(12.8, 1.0){\circle*{0.3}}
+\put(15.6, 1.0){\circle*{0.3}}
+\put(19.2, 1.0){\circle*{0.3}}
+\put(20.8, 1.0){\circle*{0.3}}
+\put(22.4, 1.0){\circle*{0.3}}
+\put(23.6, 1.0){\circle*{0.3}}
+\put(26.8, 1.0){\circle*{0.3}}
+\put(29.2, 1.0){\circle*{0.3}}
+\put( 6.8, 1.0){\vector(1,0){27.2}}
+\put( 2.0, 1.0){\vector(0,1){10.8}}
+\put( 2.0, 3.5){\line(1,0){10.8}}
+\put( 2.0, 6.0){\line(1,0){17.2}}
+\put( 2.0, 8.5){\line(1,0){21.6}}
+\put( 2.0,11.0){\line(1,0){27.2}}
+\put(12.8, 1.0){\line(0,1){2.0}}
+\put(19.2, 1.0){\line(0,1){5.0}}
+\put(20.0, 1.0){\line(0,1){5.0}}
+\put(23.6, 1.0){\line(0,1){7.0}}
+\put( 9.0, 4.0){\line(1,0){3.8}}
+\put( 9.2, 2.7){\vector(0,1){0.8}}
+\put( 9.2, 4.8){\vector(0,-1){0.8}}
+\put(19.4, 8.0){\line(1,0){3.0}}
+\put(19.6, 7.2){\vector(0,1){0.8}}
+\put(19.6, 9.3){\vector(0,-1){0.8}}
+\put(13.0, 2.2){\small $q_{25\%}$}
+\put(17.3, 2.2){\small $q_{50\%}$}
+\put(23.8, 2.2){\small $q_{75\%}$}
+\put(20.15,3.5){\small $\mu$}
+\put( 8.0, 3.75){\small $\phi_1$}
+\put(18.35,7.8){\small $\phi_2$}
+\end{picture}
+\label{fig:example_quartiles}
+\caption{The computation of quartiles, median, and interquartile mean from the
+empirical distribution function of the 10-point
+sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$.  Each 
vertical step in
+the graph has height~$1{/}n = 0.1$.  Values $q_{25\%}$, $q_{50\%}$, and 
$q_{75\%}$ denote
+the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles 
correspondingly;
+value~$\mu$ denotes the median.  Values $\phi_1$ and $\phi_2$ show the partial 
contribution
+of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile 
mean.}
+\end{figure}
+
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Median]
+\OutputRowText{\OutputRowIDMedian}
+The ``middle'' value that separates the higher half of the sample values
+(in a sorted order) from the lower half.
+To compute the median, we sort the sample in the increasing order, preserving
+duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
+If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$,
+same as the $50^{\textrm{th}}$~percentile of the sample.
+If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and 
$v^s_{n/2\,+\,1}$,
+so we compute the median as the mean of these two values.
+(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$,
+not as the median.)  Example: the median of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see 
Figure~\ref{fig:example_quartiles}.
+
+Unlike the mean, the median is not sensitive to extreme values in the sample,
+i.e.\ it is robust to outliers.  It works better as a measure of central 
tendency
+for heavy-tailed distributions and features that vary on exponential scale.
+However, the median is sensitive to small sample size.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Interquartile mean]
+\OutputRowText{\OutputRowIDIQMean}
+For a sample of a quantitative feature, this is
+the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile
+and less than or equal the $3^{\textrm{rd}}$ quartile.
+In other words, it is a ``truncated mean'' where the lowest 25$\%$ and
+the highest 25$\%$ of the sorted values are omitted in its computation.
+The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the 
$3^{\textrm{rd}}$
+quartiles themselves, contribute to this mean only partially.
+This measure is occasionally used as the ``robust'' version of the mean
+that is less sensitive to the extreme values.
+
+To compute the measure, we sort the sample in the increasing order,
+preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
+We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index
+and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index,
+then compute the following weighted mean:
+\begin{equation*}
+\frac{1}{3{/}4 - 1{/}4} \left[
+\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ 
+\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i 
+\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right]
+\end{equation*}
+In other words, all sample values between the $1^{\textrm{st}}$ and the 
$3^{\textrm{rd}}$
+quartile enter the sum with weights $2{/}n$, times their number of duplicates, 
while the
+two quartiles themselves enter the sum with reduced weights.  The weights are 
proportional
+to the vertical steps in the empirical distribution function of the sample, see
+Figure~\ref{fig:example_quartiles} for an illustration.
+Example: the interquartile mean of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum
+$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$,
+which equals~5.31.
+\end{Description}
+
+
+\paragraph{Dispersion measures.}
+Statistics that describe the amount of variation or spread in a quantitative
+(scale) data feature.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Variance]
+\OutputRowText{\OutputRowIDVariance}
+A measure of dispersion, or spread-out, of sample values around their mean,
+expressed in units that are the square of those of the feature itself.
+Computed as the sum of squared differences between the values
+in the sample and their mean, divided by one less than the number of
+values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where 
+$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
+Example: the variance of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24.
+Note that at least two values ($n\geq 2$) are required to avoid division
+by zero.  Sample variance is sensitive to outliers, even more than the mean.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard deviation]
+\OutputRowText{\OutputRowIDStDeviation}
+A measure of dispersion around the mean, the square root of variance.
+Computed by taking the square root of the sample variance;
+see \emph{Variance} above on computing the variance.
+Example: the standard deviation of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8.
+At least two values are required to avoid division by zero.
+Note that standard deviation is sensitive to outliers.  
+
+Standard deviation is used in conjunction with the mean to determine
+an interval containing a given percentage of the feature values,
+assuming the normal distribution.  In a large sample from a normal
+distribution, around 68\% of the cases fall within one standard
+deviation and around 95\% of cases fall within two standard deviations
+of the mean.  For example, if the mean age is 45 with a standard deviation
+of 10, around 95\% of the cases would be between 25 and 65 in a normal
+distribution.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Coefficient of variation]
+\OutputRowText{\OutputRowIDCoeffVar}
+The ratio of the standard deviation to the mean, i.e.\ the
+\emph{relative} standard deviation, of a quantitative feature sample.
+Computed by dividing the sample \emph{standard deviation} by the
+sample \emph{mean}, see above for their computation details.
+Example: the coefficient of variation for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals 1.8$\,{/}\,$5.2~${\approx}$~0.346.
+
+This metric is used primarily with non-negative features such as
+financial or population data.  It is sensitive to outliers.
+Note: zero mean causes division by zero, returning infinity or \texttt{NaN}.
+At least two values (records) are required to compute the standard deviation.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Minimum]
+\OutputRowText{\OutputRowIDMinimum}
+The smallest value of a quantitative sample, computed as $\min v = v^s_1$.
+Example: the minimum of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals~2.2.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Maximum]
+\OutputRowText{\OutputRowIDMaximum}
+The largest value of a quantitative sample, computed as $\max v = v^s_n$.
+Example: the maximum of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals~7.8.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Range]
+\OutputRowText{\OutputRowIDRange}
+The difference between the largest and the smallest value of a quantitative
+sample, computed as $\max v - \min v = v^s_n - v^s_1$.
+It provides information about the overall spread of the sample values.
+Example: the range of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals 7.8$\,{-}\,$2.2~${=}$~5.6.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard error of the mean]
+\OutputRowText{\OutputRowIDStErrorMean}
+A measure of how much the value of the sample mean may vary from sample
+to sample taken from the same (hypothesized) distribution of the feature.
+It helps to roughly bound the distribution mean, i.e.\
+the limit of the sample mean as the sample size tends to infinity.
+Under certain assumptions (e.g.\ normality and large sample), the difference
+between the distribution mean and the sample mean is unlikely to exceed
+2~standard errors.
+
+The measure is computed by dividing the sample standard deviation
+by the square root of the number of values~$n$; see \emph{standard deviation}
+for its computation details.  Ensure $n\,{\geq}\,2$ to avoid division by~0.
+Example: for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+with the mean of~5.2 the standard error of the mean
+equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569.
+
+Note that the standard error itself is subject to sample randomness.
+Its accuracy as an error estimator may be low if the sample size is small
+or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has
+heavy tails.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+% \item[\it Quartiles]
+% \OutputRowText{\OutputRowIDQuartiles}
+% %%% dsDefn %%%%
+% The values of a quantitative feature
+% that divide an ordered/sorted set of data records into four equal-size 
groups.
+% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits
+% the sorted data into the lowest $25\%$ and the highest~$75\%$.  In other 
words,
+% it is the middle value between the minimum and the median.  The 
$2^{\textrm{nd}}$
+% quartile is the median itself, the value that separates the higher half of
+% the data (in the sorted order) from the lower half.  Finally, the 
$3^{\textrm{rd}}$
+% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into
+% lowest $75\%$ and highest~$25\%$.\par
+% %%% dsComp %%%%
+% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical 
values
+% we sort it in the increasing order, preserving duplicates, then return 
+% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]}
+% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$.
+% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further 
adjusted
+% to equal the mean of two middle values
+% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and
+% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$.
+% %%% dsWarn %%%%
+% We assume that the feature column does not contain \texttt{NaN}s or coded 
non-numeric values.
+% %%% dsExmpl %%%
+% \textbf{Example(s).}
+\end{Description}
+
+
+\paragraph{Shape measures.}
+Statistics that describe the shape and symmetry of the quantitative (scale)
+feature distribution estimated from a sample of its values.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Skewness]
+\OutputRowText{\OutputRowIDSkewness}
+It measures how symmetrically the values of a feature are spread out
+around the mean.  A significant positive skewness implies a longer (or fatter)
+right tail, i.e. feature values tend to lie farther away from the mean on the
+right side.  A significant negative skewness implies a longer (or fatter) left
+tail.  The normal distribution is symmetric and has a skewness value of~0;
+however, its sample skewness is likely to be nonzero, just close to zero.
+As a guideline, a skewness value more than twice its standard error is taken
+to indicate a departure from symmetry.
+
+Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the 
cube
+of the standard deviation.  We estimate the $3^{\textrm{rd}}$~central moment as
+the sum of cubed differences between the values in the feature column and their
+sample mean, divided by the number of values:  
+$\sum_{i=1}^n (v_i - \bar{v})^3 / n$
+where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
+The standard deviation is computed
+as described above in \emph{standard deviation}.  To avoid division by~0,
+at least two different sample values are required.  Example: for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+with the mean of~5.2 and the standard deviation of~1.8
+skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$.
+Note: skewness is sensitive to outliers.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard error in skewness]
+\OutputRowText{\OutputRowIDStErrorSkewness}
+A measure of how much the sample skewness may vary from sample to sample,
+assuming that the feature is normally distributed, which makes its
+distribution skewness equal~0.  
+Given the number~$n$ of sample values, the standard error is computed as
+\begin{equation*}
+\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}}
+\end{equation*}
+This measure can tell us, for example:
+\begin{Itemize}
+\item If the sample skewness lands within two standard errors from~0, its
+positive or negative sign is non-significant, may just be accidental.
+\item If the sample skewness lands outside this interval, the feature
+is unlikely to be normally distributed.
+\end{Itemize}
+At least 3~values ($n\geq 3$) are required to avoid arithmetic failure.
+Note that the standard error is inaccurate if the feature distribution is
+far from normal or if the number of samples is small.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Kurtosis]
+\OutputRowText{\OutputRowIDKurtosis}
+As a distribution parameter, kurtosis is a measure of the extent to which
+feature values cluster around a central point.  In other words, it quantifies
+``peakedness'' of the distribution: how tall and sharp the central peak is
+relative to a standard bell curve.
+
+Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative
+to a normal distribution:
+\begin{Itemize}
+\item observations cluster more about the center (peak-shaped),
+\item the tails are thinner at non-extreme values, 
+\item the tails are thicker at extreme values.
+\end{Itemize}
+Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative
+to a normal distribution:
+\begin{Itemize}
+\item observations cluster less about the center (box-shaped),
+\item the tails are thicker at non-extreme values, 
+\item the tails are thinner at extreme values.
+\end{Itemize}
+Kurtosis of a normal distribution is zero; however, the sample kurtosis
+(computed here) is likely to deviate from zero.
+
+Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided
+by the $4^{\textrm{th}}$~power of the standard deviation, minus~3.
+We estimate the $4^{\textrm{th}}$~central moment as the sum of the
+$4^{\textrm{th}}$~powers of differences between the values in the feature 
column
+and their sample mean, divided by the number of values:
+$\sum_{i=1}^n (v_i - \bar{v})^4 / n$
+where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
+The standard deviation is computed as described above, see \emph{standard 
deviation}.
+
+Note that kurtosis is sensitive to outliers, and requires at least two 
different
+sample values.  Example: for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+with the mean of~5.2 and the standard deviation of~1.8,
+sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard error in kurtosis]
+\OutputRowText{\OutputRowIDStErrorCurtosis}
+A measure of how much the sample kurtosis may vary from sample to sample,
+assuming that the feature is normally distributed, which makes its
+distribution kurtosis equal~0.
+Given the number~$n$ of sample values, the standard error is computed as
+\begin{equation*}
+\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}}
+\end{equation*}
+This measure can tell us, for example:
+\begin{Itemize}
+\item If the sample kurtosis lands within two standard errors from~0, its
+positive or negative sign is non-significant, may just be accidental.
+\item If the sample kurtosis lands outside this interval, the feature
+is unlikely to be normally distributed.
+\end{Itemize}
+At least 4~values ($n\geq 4$) are required to avoid arithmetic failure.
+Note that the standard error is inaccurate if the feature distribution is
+far from normal or if the number of samples is small.
+\end{Description}
+
+
+\paragraph{Categorical measures.}  Statistics that describe the sample of
+a categorical feature, either nominal or ordinal.  We represent all
+categories by integers from~1 to the number of categories; we call
+these integers \emph{category~IDs}.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Number of categories]
+\OutputRowText{\OutputRowIDNumCategories}
+The maximum category~ID that occurs in the sample.  Note that some
+categories with~IDs \emph{smaller} than this maximum~ID may have
+no~occurrences in the sample, without reducing the number of categories.
+However, any categories with~IDs \emph{larger} than the maximum~ID with
+no occurrences in the sample will not be counted.
+Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
+the number of categories is reported as~8.  Category~IDs 2 and~6, which have
+zero occurrences, are still counted; but if there is a category with
+ID${}=9$ and zero occurrences, it is not counted.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Mode]
+\OutputRowText{\OutputRowIDMode}
+The most frequently occurring category value.
+If several values share the greatest frequency of occurrence, then each
+of them is a mode; but here we report only the smallest of these modes.
+Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
+the modes are 3 and~7, with 3 reported.
+
+Computed by counting the number of occurrences for each category,
+then taking the smallest category~ID that has the maximum count.
+Note that the sample modes may be different from the distribution modes,
+i.e.\ the categories whose (hypothesized) underlying probability is the
+maximum over all categories.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Number of modes]
+\OutputRowText{\OutputRowIDNumModes}
+The number of category values that each have the largest frequency
+count in the sample.  
+Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
+there are two category IDs (3 and~7) that occur the maximum count of 4~times;
+hence, we return~2.
+
+Computed by counting the number of occurrences for each category,
+then counting how many categories have the maximum count.
+Note that the sample modes may be different from the distribution modes,
+i.e.\ the categories whose (hypothesized) underlying probability is the
+maximum over all categories.
+\end{Description}
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The output matrix containing all computed statistics is of size $17$~rows and
+as many columns as in the input matrix~\texttt{X}.  Each row corresponds to
+a particular statistic, according to the convention specified in
+Table~\ref{table:univars}.  The first $14$~statistics are applicable for
+\emph{scale} columns, and the last $3$~statistics are applicable for 
categorical,
+i.e.\ nominal and ordinal, columns.
+
+
+\pagebreak[2]
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx
+  TYPES=/user/biadmin/types.mtx
+  STATS=/user/biadmin/stats.mtx
+
+}

[04/10] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Reply via email to