http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveBivarStats.tex ---------------------------------------------------------------------- diff --git a/alg-ref/DescriptiveBivarStats.tex b/alg-ref/DescriptiveBivarStats.tex new file mode 100644 index 0000000..a2d3db1 --- /dev/null +++ b/alg-ref/DescriptiveBivarStats.tex @@ -0,0 +1,438 @@ +\begin{comment} + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +\end{comment} + +\subsection{Bivariate Statistics} + +\noindent{\bf Description} +\smallskip + +Bivariate statistics are used to quantitatively describe the association between +two features, such as test their statistical (in-)dependence or measure +the accuracy of one data feature predicting the other feature, in a sample. +The \BivarScriptName{} script computes common bivariate statistics, +such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs +of data features. For a given dataset matrix, script \BivarScriptName{} computes +certain bivariate statistics for the given feature (column) pairs in the +matrix. The feature types govern the exact set of statistics computed for that pair. +For example, \NameStatR{} can only be computed on two quantitative (scale) +features like `Height' and `Temperature'. +It does not make sense to compute the linear correlation of two categorical attributes +like `Hair Color'. + + +\smallskip +\noindent{\bf Usage} +\smallskip + +{\hangindent=\parindent\noindent\it%\tolerance=0 +{\tt{}-f }path/\/\BivarScriptName{} +{\tt{} -nvargs} +{\tt{} X=}path/file +{\tt{} index1=}path/file +{\tt{} index2=}path/file +{\tt{} types1=}path/file +{\tt{} types2=}path/file +{\tt{} OUTDIR=}path +% {\tt{} fmt=}format + +} + + +\smallskip +\noindent{\bf Arguments} +\begin{Description} +\item[{\tt X}:] +Location (on HDFS) to read the data matrix $X$ whose columns are the features +that we want to compare and correlate with bivariate statistics. +\item[{\tt index1}:] % (default:\mbox{ }{\tt " "}) +Location (on HDFS) to read the single-row matrix that lists the column indices +of the \emph{first-argument} features in pairwise statistics. +Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the +index $k$ of column \texttt{X[,$\,k$]} in the data matrix +whose bivariate statistics need to be computed. +% The default value means ``use all $X$-columns from the first to the last.'' +\item[{\tt index2}:] % (default:\mbox{ }{\tt " "}) +Location (on HDFS) to read the single-row matrix that lists the column indices +of the \emph{second-argument} features in pairwise statistics. +Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the +index $l$ of column \texttt{X[,$\,l$]} in the data matrix +whose bivariate statistics need to be computed. +% The default value means ``use all $X$-columns from the first to the last.'' +\item[{\tt types1}:] % (default:\mbox{ }{\tt " "}) +Location (on HDFS) to read the single-row matrix that lists the \emph{types} +of the \emph{first-argument} features in pairwise statistics. +Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the type +of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the $i^{\textrm{th}}$ +entry in the {\tt index1} matrix. Feature types must be encoded by +integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. +% The default value means ``treat all referenced $X$-columns as scale.'' +\item[{\tt types2}:] % (default:\mbox{ }{\tt " "}) +Location (on HDFS) to read the single-row matrix that lists the \emph{types} +of the \emph{second-argument} features in pairwise statistics. +Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the type +of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the $j^{\textrm{th}}$ +entry in the {\tt index2} matrix. Feature types must be encoded by +integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. +% The default value means ``treat all referenced $X$-columns as scale.'' +\item[{\tt OUTDIR}:] +Location path (on HDFS) where the output matrices with computed bivariate +statistics will be stored. The matrices' file names and format are defined +in Table~\ref{table:bivars}. +% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) +% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; +% see read/write functions in SystemML Language Reference for details. +\end{Description} + +\begin{table}[t]\hfil +\begin{tabular}{|lll|} +\hline\rule{0pt}{12pt}% +Ouput File / Matrix & Row$\,$\# & Name of Statistic \\[2pt] +\hline\hline\rule{0pt}{12pt}% +\emph{All Files} & 1 & 1-st feature column \\ +\rule{1em}{0pt}" & 2 & 2-nd feature column \\[2pt] +\hline\rule{0pt}{12pt}% +bivar.scale.scale.stats & 3 & \NameStatR \\[2pt] +\hline\rule{0pt}{12pt}% +bivar.nominal.nominal.stats & 3 & \NameStatChi \\ +\rule{1em}{0pt}" & 4 & Degrees of freedom \\ +\rule{1em}{0pt}" & 5 & \NameStatPChi \\ +\rule{1em}{0pt}" & 6 & \NameStatV \\[2pt] +\hline\rule{0pt}{12pt}% +bivar.nominal.scale.stats & 3 & \NameStatEta \\ +\rule{1em}{0pt}" & 4 & \NameStatF \\[2pt] +\hline\rule{0pt}{12pt}% +bivar.ordinal.ordinal.stats & 3 & \NameStatRho \\[2pt] +\hline +\end{tabular}\hfil +\caption{% +The output matrices of \BivarScriptName{} have one row per one bivariate +statistic and one column per one pair of input features. This table lists +the meaning of each matrix and each row.% +% Signs ``+'' show applicability to scale or/and to categorical features. +} +\label{table:bivars} +\end{table} + + + +\pagebreak[2] + +\noindent{\bf Details} +\smallskip + +Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns represent +the features and whose rows represent the records of a data sample. +Given \texttt{X}, the script computes certain relevant bivariate statistics +for specified pairs of feature columns \texttt{X[,$\,i$]} and \texttt{X[,$\,j$]}. +Command-line parameters \texttt{index1} and \texttt{index2} specify the files with +column pairs of interest to the user. Namely, the file given by \texttt{index1} +contains the vector of the 1st-attribute column indices and the file given +by \texttt{index2} has the vector of the 2nd-attribute column indices, with +``1st'' and ``2nd'' referring to their places in bivariate statistics. +Note that both \texttt{index1} and \texttt{index2} files should contain a 1-row matrix +of positive integers. + +The bivariate statistics to be computed depend on the \emph{types}, or +\emph{measurement levels}, of the two columns. +The types for each pair are provided in the files whose locations are specified by +\texttt{types1} and \texttt{types2} command-line parameters. +These files are also 1-row matrices, i.e.\ vectors, that list the 1st-attribute and +the 2nd-attribute column types in the same order as their indices in the +\texttt{index1} and \texttt{index2} files. The types must be provided as per +the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. + +The script orgainizes its results into (potentially) four output matrices, one per +each type combination. The types of bivariate statistics are defined using the types +of the columns that were used for their arguments, with ``ordinal'' sometimes +retrogressing to ``nominal.'' Table~\ref{table:bivars} describes what each column +in each output matrix contains. In particular, the script includes the following +statistics: +\begin{Itemize} +\item For a pair of scale (quantitative) columns, \NameStatR; +\item For a pair of nominal columns (with finite-sized, fixed, unordered domains), +the \NameStatChi{} and its p-value; +\item For a pair of one scale column and one nominal column, \NameStatF{}; +\item For a pair of ordinal columns (ordered domains depicting ranks), \NameStatRho. +\end{Itemize} +Note that, as shown in Table~\ref{table:bivars}, the output matrices contain the +column indices of the features involved in each statistic. +Moreover, if the output matrix does not contain +a value in a certain cell then it should be interpreted as a~$0$ +(sparse matrix representation). + +Below we list all bivariate statistics computed by script \BivarScriptName. +The statistics are collected into several groups by the type of their input +features. We refer to the two input features as $v_1$ and $v_2$ unless +specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$, +where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size. + + +\paragraph{Scale-vs-scale statistics.} +Sample statistics that describe association between two quantitative (scale) features. +A scale feature has numerical values, with the natural ordering relation. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it\NameStatR]: +A measure of linear dependence between two numerical features: +\begin{equation*} +r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}} +\,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}% +{\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}} +\end{equation*} +Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching ${\pm}1$ when all value +pairs $(v_{1,i}, v_{2,i})$ lie on the same line. Correlation near~0 means that a line is not a good +way to represent the dependence between the two features; however, this does not imply independence. +The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if one feature tends to +linearly increase (decrease) when the other feature increases. Nonlinear association, if present, +may disobey this sign. +\NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if we transform $v_1$ and $v_2$ +to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$. + +Suppose that we use simple linear regression to represent one feature given the other, say +represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and $\beta$ +to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$. +Then the best error equals +\begin{equation*} +\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\, +(1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2 +\end{equation*} +In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to +the total sum of squares. Hence, $r^2$ is an accuracy measure of the linear regression. +\end{Description} + + +\paragraph{Nominal-vs-nominal statistics.} +Sample statistics that describe association between two nominal categorical features. +Both features' value domains are encoded with positive integers in arbitrary order: +nominal features do not order their value domains. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it\NameStatChi]: +A measure of how much the frequencies of value pairs of two categorical features deviate from +statistical independence. Under independence, the probability of every value pair must equal +the product of probabilities of each value in the pair: +$\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$. But we do not know these (hypothesized) probabilities; +we only know the sample frequency counts. Let $n_{a,b}$ be the frequency count of pair +$(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of $b$~alone. Under +independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly~0 due +to sample randomness, yet it is unlikely to be too far from~0. For some pairs $(a,b)$ it may +deviate from~0 farther than for other pairs. \NameStatChi{}~is an aggregate measure that +combines squares of these differences across all value pairs: +\begin{equation*} +\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2 +\,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}} +\end{equation*} +where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = (n_a n_b){/}n$ are +the \emph{expected} frequencies for all pairs~$(a,b)$. Under independence (plus other standard +assumptions) the sample~$\chi^2$ closely follows a well-known distribution, making it a basis for +statistical tests for independence, see~\emph{\NameStatPChi} for details. Note that \NameStatChi{} +does \emph{not} measure the strength of dependence: even very weak dependence may result in a +significant deviation from independence if the counts are large enough. Use~\NameStatV{} instead +to measure the strength of dependence. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Degrees of freedom]: +An integer parameter required for the interpretation of~\NameStatChi{} measure. Under independence +(plus other standard assumptions) the sample~$\chi^2$ statistic is approximately distributed as the +sum of $d$~squares of independent normal random variables with mean~0 and variance~1, where $d$ is +this integer parameter. For a pair of categorical features such that the $1^{\textrm{st}}$~feature +has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, the number of degrees +of freedom is $d = (k_1 - 1)(k_2 - 1)$. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it\NameStatPChi]: +A measure of how likely we would observe the current frequencies of value pairs of two categorical +features assuming their statistical independence. More precisely, it computes the probability that +the sum of $d$~squares of independent normal random variables with mean~0 and variance~1 +(called the $\chi^2$~distribution with $d$ degrees of freedom) generates a value at least as large +as the current sample \NameStatChi. The $d$ parameter is \emph{degrees of freedom}, see above. +Under independence (plus other standard assumptions) the sample \NameStatChi{} closely follows the +$\chi^2$~distribution and is unlikely to land very far into its tail. On the other hand, if the +two features are dependent, their sample \NameStatChi{} becomes arbitrarily large as $n\to\infty$ +and lands extremely far into the tail of the $\chi^2$~distribution given a large enough data sample. +\NameStatPChi{} returns the tail ``weight'' on the right-hand side of \NameStatChi: +\begin{equation*} +P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim \textrm{the $\chi^2$ distribution}\big] +\end{equation*} +As any probability, $P$ ranges between 0 and~1. If $P\leq 0.05$, the dependence between the two +features may be considered statistically significant (i.e.\ their independence is considered +statistically ruled out). For highly dependent features, it is not unusual to have $P\leq 10^{-20}$ +or less, in which case our script will simply return $P = 0$. Independent features should have +their $P\geq 0.05$ in about 95\% cases. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it\NameStatV]: +A measure for the strength of association, i.e.\ of statistical dependence, between two categorical +features, conceptually similar to \NameStatR. It divides the observed~\NameStatChi{} by the maximum +possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of categories in each feature, +then takes the square root. Thus, \NameStatV{} ranges from 0 to~1, +where 0 implies no association and 1 implies the maximum possible association (one-to-one +correspondence) between the two features. See \emph{\NameStatChi} for the computation of~$\chi^2$; +its maximum${} = {}$% +$n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature +has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories~\cite{AcockStavig1979:CramersV}, +so +\begin{equation*} +\textrm{\NameStatV} \,\,=\,\, \sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}} +\end{equation*} +As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' dependence increases, +\NameStatV{} goes towards~1 (slowly) as the dependence increases. Both \NameStatChi{} and +\NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is mitigated by taking the +ratio. +\end{Description} + + +\paragraph{Nominal-vs-scale statistics.} +Sample statistics that describe association between a categorical feature +(order ignored) and a quantitative (scale) feature. +The values of the categorical feature must be coded as positive integers. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it\NameStatEta]: +A measure for the strength of association (statistical dependence) between a nominal feature +and a scale feature, conceptually similar to \NameStatR. Ranges from 0 to~1, approaching 0 +when there is no association and approaching 1 when there is a strong association. +The nominal feature, treated as the independent variable, is assumed to have relatively few +possible values, all with large frequency counts. The scale feature is treated as the dependent +variable. Denoting the nominal feature by~$x$ and the scale feature by~$y$, we have: +\begin{equation*} +\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}, +\,\,\,\,\textrm{where}\,\,\,\, +\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n +\,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\! +\end{equation*} +and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean. Value $\hat{y}[x]$ is the average +of~$y_i$ among all records where $x_i = x$; it can also be viewed as the ``predictor'' +of $y$ given~$x$. Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error +sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum-of-squares for~$y$. +Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the +``R-squared'' statistic measures the accuracy of linear regression. Our output $\eta$ +is the square root of~$\eta^2$. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it\NameStatF]: +A measure of how much the values of the scale feature, denoted here by~$y$, +deviate from statistical independence on the nominal feature, denoted by~$x$. +The same measure appears in the one-way analysis of vari\-ance (ANOVA). +Like \NameStatChi, \NameStatF{} is used to test the hypothesis that +$y$~is independent from~$x$, given the following assumptions: +\begin{Itemize} +\item The scale feature $y$ has approximately normal distribution whose mean +may depend only on~$x$ and variance is the same for all~$x$; +\item The nominal feature $x$ has relatively small value domain with large +frequency counts, the $x_i$-values are treated as fixed (non-random); +\item All records are sampled independently of each other. +\end{Itemize} +To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$ +among all records where $x_i = x$. These $\hat{y}[x]$ can be viewed as +``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should +``predict'' only the global mean~$\bar{y}$. Then we form two sums-of-squares: +\begin{Itemize} +\item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - \hat{y}[x_i]$; +\item \emph{Explained} sum-of-squares of the ``predictor'' variability: $\hat{y}[x_i] - \bar{y}$. +\end{Itemize} +\NameStatF{} is the ratio of the explained sum-of-squares to +the residual sum-of-squares, each divided by their corresponding degrees +of freedom: +\begin{equation*} +F \,\,=\,\, +\frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 \,\big/\,\, (k\,{-}\,1)}% +{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} \,\,=\,\, +\frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2} +\end{equation*} +Here $k$ is the domain size of the nominal feature~$x$. The $k$ ``predictors'' +lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly, +the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''. + +The statistic can test if the independence hypothesis of $y$ from $x$ is reasonable; +more generally (with relaxed normality assumptions) it can test the hypothesis that +\emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$. +Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, n\,{-}\,k)$-distribution. +But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{} +becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely far +into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large enough data sample. +\end{Description} + + +\paragraph{Ordinal-vs-ordinal statistics.} +Sample statistics that describe association between two ordinal categorical features. +Both features' value domains are encoded with positive integers, so that the natural +order of the integers coincides with the order in each value domain. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it\NameStatRho]: +A measure for the strength of association (statistical dependence) between +two ordinal features, conceptually similar to \NameStatR. Specifically, it is \NameStatR{} +applied to the feature vectors in which all values are replaced by their ranks, i.e.\ +their positions if the vector is sorted. The ranks of identical (duplicate) values +are replaced with their average rank. For example, in vector +$(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the sorted +order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their average +rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 2,\, 5,\, 3.5,\, 1)$. + +Our implementation of \NameStatRho{} is geared towards features having small value domains +and large counts for the values. Given the two input vectors, we form a contingency table $T$ +of pairwise frequency counts, as well as a vector of frequency counts for each feature: $f_1$ +and~$f_2$. Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer to the +order-preserving integer encoding of the feature values. +We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks: +$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously for~$r_2$. +Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, f_2$ and their +covariance weighted by~$T$, before applying the standard formula for \NameStatR: +\begin{equation*} +\rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}} +\,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}% +{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}} +\end{equation*} +where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$. +The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the prevalent direction +of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to increase (decrease) +when the other feature increases. The correlation becomes~1 when the two features are +monotonically related. +\end{Description} + + +\smallskip +\noindent{\bf Returns} +\smallskip + +A collection of (potentially) 4 matrices. Each matrix contains bivariate statistics that +resulted from a different combination of feature types. There is one matrix for scale-scale +statistics (which includes \NameStatR), one for nominal-nominal statistics (includes \NameStatChi{}), +one for nominal-scale statistics (includes \NameStatF) and one for ordinal-ordinal statistics +(includes \NameStatRho). If any of these matrices is not produced, then no pair of columns required +the corresponding type combination. See Table~\ref{table:bivars} for the matrix naming and +format details. + + +\smallskip +\pagebreak[2] + +\noindent{\bf Examples} +\smallskip + +{\hangindent=\parindent\noindent\tt +\hml -f \BivarScriptName{} -nvargs +X=/user/biadmin/X.mtx +index1=/user/biadmin/S1.mtx +index2=/user/biadmin/S2.mtx +types1=/user/biadmin/K1.mtx +types2=/user/biadmin/K2.mtx +OUTDIR=/user/biadmin/stats.mtx + +} +
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveStats.tex ---------------------------------------------------------------------- diff --git a/alg-ref/DescriptiveStats.tex b/alg-ref/DescriptiveStats.tex new file mode 100644 index 0000000..5a59ad4 --- /dev/null +++ b/alg-ref/DescriptiveStats.tex @@ -0,0 +1,115 @@ +\begin{comment} + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +\end{comment} + +\newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}} +\newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}} + +\newcommand{\OutputRowIDMinimum}{1} +\newcommand{\OutputRowIDMaximum}{2} +\newcommand{\OutputRowIDRange}{3} +\newcommand{\OutputRowIDMean}{4} +\newcommand{\OutputRowIDVariance}{5} +\newcommand{\OutputRowIDStDeviation}{6} +\newcommand{\OutputRowIDStErrorMean}{7} +\newcommand{\OutputRowIDCoeffVar}{8} +\newcommand{\OutputRowIDQuartiles}{?, 13, ?} +\newcommand{\OutputRowIDMedian}{13} +\newcommand{\OutputRowIDIQMean}{14} +\newcommand{\OutputRowIDSkewness}{9} +\newcommand{\OutputRowIDKurtosis}{10} +\newcommand{\OutputRowIDStErrorSkewness}{11} +\newcommand{\OutputRowIDStErrorCurtosis}{12} +\newcommand{\OutputRowIDNumCategories}{15} +\newcommand{\OutputRowIDMode}{16} +\newcommand{\OutputRowIDNumModes}{17} +\newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}} + +\newcommand{\NameStatR}{Pearson's correlation coefficient} +\newcommand{\NameStatChi}{Pearson's~$\chi^2$} +\newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$} +\newcommand{\NameStatV}{Cram\'er's~$V$} +\newcommand{\NameStatEta}{Eta statistic} +\newcommand{\NameStatF}{$F$~statistic} +\newcommand{\NameStatRho}{Spearman's rank correlation coefficient} + +Descriptive statistics are used to quantitatively describe the main characteristics of the data. +They provide meaningful summaries computed over different observations or data records +collected in a study. These summaries typically form the basis of the initial data exploration +as part of a more extensive statistical analysis. Such a quantitative analysis assumes that +every variable (also known as, attribute, feature, or column) in the data has a specific +\emph{level of measurement}~\cite{Stevens1946:scales}. + +The measurement level of a variable, often called as {\bf variable type}, can either be +\emph{scale} or \emph{categorical}. A \emph{scale} variable represents the data measured on +an interval scale or ratio scale. Examples of scale variables include `Height', `Weight', +`Salary', and `Temperature'. Scale variables are also referred to as \emph{quantitative} +or \emph{continuous} variables. In contrast, a \emph{categorical} variable has a fixed +limited number of distinct values or categories. Examples of categorical variables +include `Gender', `Region', `Hair color', `Zipcode', and `Level of Satisfaction'. +Categorical variables can further be classified into two types, \emph{nominal} and +\emph{ordinal}, depending on whether the categories in the variable can be ordered via an +intrinsic ranking. For example, there is no meaningful ranking among distinct values in +`Hair color' variable, while the categories in `Level of Satisfaction' can be ranked from +highly dissatisfied to highly satisfied. + +The input dataset for descriptive statistics is provided in the form of a matrix, whose +rows are the records (data points) and whose columns are the features (i.e.~variables). +Some scripts allow this matrix to be vertically split into two or three matrices. Descriptive +statistics are computed over the specified features (columns) in the matrix. Which +statistics are computed depends on the types of the features. It is important to keep +in mind the following caveats and restrictions: +\begin{Enumerate} +\item Given a finite set of data records, i.e.~a \emph{sample}, we take their feature +values and compute their \emph{sample statistics}. These statistics +will vary from sample to sample even if the underlying distribution of feature values +remains the same. Sample statistics are accurate for the given sample only. +If the goal is to estimate the \emph{distribution statistics} that are parameters of +the (hypothesized) underlying distribution of the features, the corresponding sample +statistics may sometimes be used as approximations, but their accuracy will vary. +\item In particular, the accuracy of the estimated distribution statistics will be low +if the number of values in the sample is small. That is, for small samples, the computed +statistics may depend on the randomness of the individual sample values more than on +the underlying distribution of the features. +\item The accuracy will also be low if the sample records cannot be assumed mutually +independent and identically distributed (i.i.d.), that is, sampled at random from the +same underlying distribution. In practice, feature values in one record often depend +on other features and other records, including unknown ones. +\item Most of the computed statistics will have low estimation accuracy in the presence of +extreme values (outliers) or if the underlying distribution has heavy tails, for example +obeys a power law. However, a few of the computed statistics, such as the median and +\NameStatRho{}, are \emph{robust} to outliers. +\item Some sample statistics are reported with their \emph{sample standard errors} +in an attempt to quantify their accuracy as distribution parameter estimators. But these +sample standard errors, in turn, only estimate the underlying distribution's standard +errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, outliers in samples, +or heavy-tailed distributions. +\item We assume that the quantitative (scale) feature columns do not contain missing +values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless otherwise +specified. We assume that each categorical feature column contains positive integers +from 1 to the number of categories; for ordinal features, the natural order on +the integers should coincide with the order on the categories. +\end{Enumerate} + +\input{DescriptiveUnivarStats} + +\input{DescriptiveBivarStats} + +\input{DescriptiveStratStats} http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveStratStats.tex ---------------------------------------------------------------------- diff --git a/alg-ref/DescriptiveStratStats.tex b/alg-ref/DescriptiveStratStats.tex new file mode 100644 index 0000000..be0cffd --- /dev/null +++ b/alg-ref/DescriptiveStratStats.tex @@ -0,0 +1,306 @@ +\begin{comment} + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +\end{comment} + +\subsection{Stratified Bivariate Statistics} + +\noindent{\bf Description} +\smallskip + +The {\tt stratstats.dml} script computes common bivariate statistics, such +as correlation, slope, and their p-value, in parallel for many pairs of input +variables in the presence of a confounding categorical variable. The values +of this confounding variable group the records into strata (subpopulations), +in which all bivariate pairs are assumed free of confounding. The script +uses the same data model as in one-way analysis of covariance (ANCOVA), with +strata representing population samples. It also outputs univariate stratified +and bivariate unstratified statistics. + +\begin{table}[t]\hfil +\begin{tabular}{|l|ll|ll|ll||ll|} +\hline +Month of the year & \multicolumn{2}{l|}{October} & \multicolumn{2}{l|}{November} & + \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\ +Customers, millions & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\ +\hline +Promotion (0 or 1) & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ +Avg.\ sales per 1000 & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\ +\hline +\end{tabular}\hfil +\caption{Stratification example: the effect of the promotion on average sales +becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.} +\label{table:stratexample} +\end{table} + +To see how data stratification mitigates confounding, consider an (artificial) +example in Table~\ref{table:stratexample}. A highly seasonal retail item +was marketed with and without a promotion over the final 3~months of the year. +In each month the sale was more likely with the promotion than without it. +But during the peak holiday season, when shoppers came in greater numbers and +bought the item more often, the promotion was less frequently used. As a result, +if the 4-th quarter data is pooled together, the promotion's effect becomes +reversed and magnified. Stratifying by month restores the positive correlation. + +The script computes its statistics in parallel over all possible pairs from two +specified sets of covariates. The 1-st covariate is a column in input matrix~$X$ +and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may +be the same or different. The columns of interest are given by their index numbers +in special matrices. The stratum column, specified in its own matrix, is the same +for all covariate pairs. + +Both covariates in each pair must be numerical, with the 2-nd covariate normally +distributed given the 1-st covariate (see~Details). Missing covariate values or +strata are represented by~``NaN''. Records with NaN's are selectively omitted +wherever their NaN's are material to the output statistic. + +\smallskip +\pagebreak[3] + +\noindent{\bf Usage} +\smallskip + +{\hangindent=\parindent\noindent\it% +{\tt{}-f }path/\/{\tt{}stratstats.dml} +{\tt{} -nvargs} +{\tt{} X=}path/file +{\tt{} Xcid=}path/file +{\tt{} Y=}path/file +{\tt{} Ycid=}path/file +{\tt{} S=}path/file +{\tt{} Scid=}int +{\tt{} O=}path/file +{\tt{} fmt=}format + +} + + +\smallskip +\noindent{\bf Arguments} +\begin{Description} +\item[{\tt X}:] +Location (on HDFS) to read matrix $X$ whose columns we want to use as +the 1-st covariate (i.e.~as the feature variable) +\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "}) +Location to read the single-row matrix that lists all index numbers +of the $X$-columns used as the 1-st covariate; the default value means +``use all $X$-columns'' +\item[{\tt Y}:] (default:\mbox{ }{\tt " "}) +Location to read matrix $Y$ whose columns we want to use as the 2-nd +covariate (i.e.~as the response variable); the default value means +``use $X$ in place of~$Y$'' +\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "}) +Location to read the single-row matrix that lists all index numbers +of the $Y$-columns used as the 2-nd covariate; the default value means +``use all $Y$-columns'' +\item[{\tt S}:] (default:\mbox{ }{\tt " "}) +Location to read matrix $S$ that has the stratum column. +Note: the stratum column must contain small positive integers; all fractional +values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as +missing. The default value for {\tt S} means ``use $X$ in place of~$S$'' +\item[{\tt Scid}:] (default:\mbox{ }{\tt 1}) +The index number of the stratum column in~$S$ +\item[{\tt O}:] +Location to store the output matrix defined in Table~\ref{table:stratoutput} +\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) +Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; +see read/write functions in SystemML Language Reference for details. +\end{Description} + + +\begin{table}[t]\small\hfil +\begin{tabular}{|rcl|rcl|} +\hline +& Col.\# & Meaning & & Col.\# & Meaning \\ +\hline +\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em} +& 01 & $X$-column number & +\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em} +& 11 & $Y$-column number \\ +& 02 & presence count for $x$ & +& 12 & presence count for $y$ \\ +& 03 & global mean $(x)$ & +& 13 & global mean $(y)$ \\ +& 04 & global std.\ dev. $(x)$ & +& 14 & global std.\ dev. $(y)$ \\ +& 05 & stratified std.\ dev. $(x)$ & +& 15 & stratified std.\ dev. $(y)$ \\ +& 06 & $R^2$ for $x \sim {}$strata & +& 16 & $R^2$ for $y \sim {}$strata \\ +& 07 & adjusted $R^2$ for $x \sim {}$strata & +& 17 & adjusted $R^2$ for $y \sim {}$strata \\ +& 08 & p-value, $x \sim {}$strata & +& 18 & p-value, $y \sim {}$strata \\ +& 09--10 & reserved & +& 19--20 & reserved \\ +\hline +\multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em} +& 21 & presence count $(x, y)$ & +\multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em} +& 31 & presence count $(x, y, s)$ \\ +& 22 & regression slope & +& 32 & regression slope \\ +& 23 & regres.\ slope std.\ dev. & +& 33 & regres.\ slope std.\ dev. \\ +& 24 & correlation${} = \pm\sqrt{R^2}$ & +& 34 & correlation${} = \pm\sqrt{R^2}$ \\ +& 25 & residual std.\ dev. & +& 35 & residual std.\ dev. \\ +& 26 & $R^2$ in $y$ due to $x$ & +& 36 & $R^2$ in $y$ due to $x$ \\ +& 27 & adjusted $R^2$ in $y$ due to $x$ & +& 37 & adjusted $R^2$ in $y$ due to $x$ \\ +& 28 & p-value for ``slope = 0'' & +& 38 & p-value for ``slope = 0'' \\ +& 29 & reserved & +& 39 & \# strata with ${\geq}\,2$ count \\ +& 30 & reserved & +& 40 & reserved \\ +\hline +\end{tabular}\hfil +\caption{The {\tt stratstats.dml} output matrix has one row per each distinct +pair of 1-st and 2-nd covariates, and 40 columns with the statistics described +here.} +\label{table:stratoutput} +\end{table} + + + + +\noindent{\bf Details} +\smallskip + +Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is +a stratum number and $(x, y)$ are two numerical covariates. We want to analyze +conditional linear relationship between $y$ and $x$ conditioned by~$i$. +Note that $x$, but not~$y$, may represent a categorical variable if we assign a +numerical value to each category, for example 0 and 1 for two categories. + +We assume a linear regression model for~$y$: +\begin{equation} +y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\, +\eps_{i,j} \sim \Normal(0, \sigma^2) +\label{eqn:stratlinmodel} +\end{equation} +Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number +in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$. +The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata. +When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in +stratum~$i$ as +\begin{equation*} +\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad +\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i +\end{equation*} +If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$, +which gives the prediction error sum-of-squares of +\begin{equation} +\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2 +\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y +\label{eqn:stratsumsq} +\end{equation} +where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' sample +estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as +\begin{align*} +V_x \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad +V_y \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\ +V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big) +\end{align*} +They are stratified because we compute the sample (co-)variances in each stratum~$i$ +separately, then combine by summation. The stratified estimates for $\Var(X)$ and $\Var(Y)$ +tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$ +and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$ +than the global means. The stratified variance estimates the uncertainty in $x_{i,j}$ +and~$y_{i,j}$ given their stratum~$i$. + +Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq}) +gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$}, +with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS): +\begin{equation*} +\mathrm{RSS} \,\,=\, \, +\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - +\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2 +\,\,=\,\, V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big) +\end{equation*} +The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction +of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear +regression model~(\ref{eqn:stratlinmodel}). We define \emph{stratified correlation} as the +square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$. We also use RSS to estimate +the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error +of $y_{i,j}$ given $x_{i,j}$ and the stratum: +\begin{equation*} +\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}}; +\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y}; +\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\, +\Big(n = \sum_{i=1}^k n_i\Big) +\end{equation*} + +The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are +obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares, +measured by the decrease from $V_y$ to~RSS. +The $F$-statistic is the ratio of the ``explained'' sum-of-squares +to the residual sum-of-squares, divided by their corresponding degrees of freedom. +There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that +to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree +of freedom: +\begin{equation*} +F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)} +\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad +t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}. +\end{equation*} +The $t$-statistic is simply the square root of the $F$-statistic with the appropriate +choice of sign. If the null hypothesis and the linear model are both true, the $t$-statistic +has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom. We can +also compute it if we divide $\hat{\beta}$ by its estimated standard deviation: +\begin{equation*} +\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad +t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}} +\end{equation*} +The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output. + +\smallskip +\noindent{\bf Returns} +\smallskip + +The output matrix format is defined in Table~\ref{table:stratoutput}. + +\smallskip +\noindent{\bf Examples} +\smallskip + +{\hangindent=\parindent\noindent\tt +\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx + Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2 + O=/user/biadmin/Out.mtx fmt=csv + +} +{\hangindent=\parindent\noindent\tt +\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx + Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx + +} + +%\smallskip +%\noindent{\bf See Also} +%\smallskip +% +%For non-stratified bivariate statistics with a wider variety of input data types +%and statistical tests, see \ldots. For general linear regression, see +%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}. For logistic regression, appropriate +%when the response variable is categorical, see {\tt MultiLogReg.dml}. + http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveUnivarStats.tex ---------------------------------------------------------------------- diff --git a/alg-ref/DescriptiveUnivarStats.tex b/alg-ref/DescriptiveUnivarStats.tex new file mode 100644 index 0000000..5838e3e --- /dev/null +++ b/alg-ref/DescriptiveUnivarStats.tex @@ -0,0 +1,603 @@ +\begin{comment} + + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +\end{comment} + +\subsection{Univariate Statistics} + +\noindent{\bf Description} +\smallskip + +\emph{Univariate statistics} are the simplest form of descriptive statistics in data +analysis. They are used to quantitatively describe the main characteristics of each +feature in the data. For a given dataset matrix, script \UnivarScriptName{} computes +certain univariate statistics for each feature column in the +matrix. The feature type governs the exact set of statistics computed for that feature. +For example, the statistic \emph{mean} can only be computed on a quantitative (scale) +feature like `Height' and `Temperature'. It does not make sense to compute the mean +of a categorical attribute like `Hair Color'. + + +\smallskip +\noindent{\bf Usage} +\smallskip + +{\hangindent=\parindent\noindent\it%\tolerance=0 +{\tt{}-f } \UnivarScriptName{} +{\tt{} -nvargs} +{\tt{} X=}path/file +{\tt{} TYPES=}path/file +{\tt{} STATS=}path/file +% {\tt{} fmt=}format + +} + + +\medskip +\pagebreak[2] +\noindent{\bf Arguments} +\begin{Description} +\item[{\tt X}:] +Location (on HDFS) to read the data matrix $X$ whose columns we want to +analyze as the features. +\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "}) +Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$ +column-cell contains the type of the $i^{\textrm{th}}$ feature column +\texttt{X[,$\,i$]} in the data matrix. Feature types must be encoded by +integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. +% The default value means ``treat all $X$-columns as scale.'' +\item[{\tt STATS}:] +Location (on HDFS) where the output matrix of computed statistics +will be stored. The format of the output matrix is defined by +Table~\ref{table:univars}. +% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) +% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; +% see read/write functions in SystemML Language Reference for details. +\end{Description} + +\begin{table}[t]\hfil +\begin{tabular}{|rl|c|c|} +\hline +\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\ + & & Scale & Categ.\\ +\hline +\OutputRowIDMinimum & Minimum & + & \\ +\OutputRowIDMaximum & Maximum & + & \\ +\OutputRowIDRange & Range & + & \\ +\OutputRowIDMean & Mean & + & \\ +\OutputRowIDVariance & Variance & + & \\ +\OutputRowIDStDeviation & Standard deviation & + & \\ +\OutputRowIDStErrorMean & Standard error of mean & + & \\ +\OutputRowIDCoeffVar & Coefficient of variation & + & \\ +\OutputRowIDSkewness & Skewness & + & \\ +\OutputRowIDKurtosis & Kurtosis & + & \\ +\OutputRowIDStErrorSkewness & Standard error of skewness & + & \\ +\OutputRowIDStErrorCurtosis & Standard error of kurtosis & + & \\ +\OutputRowIDMedian & Median & + & \\ +\OutputRowIDIQMean & Inter quartile mean & + & \\ +\OutputRowIDNumCategories & Number of categories & & + \\ +\OutputRowIDMode & Mode & & + \\ +\OutputRowIDNumModes & Number of modes & & + \\ +\hline +\end{tabular}\hfil +\caption{The output matrix of \UnivarScriptName{} has one row per each +univariate statistic and one column per input feature. This table lists +the meaning of each row. Signs ``+'' show applicability to scale or/and +to categorical features.} +\label{table:univars} +\end{table} + + +\pagebreak[1] + +\smallskip +\noindent{\bf Details} +\smallskip + +Given an input matrix \texttt{X}, this script computes the set of all +relevant univariate statistics for each feature column \texttt{X[,$\,i$]} +in~\texttt{X}. The list of statistics to be computed depends on the +\emph{type}, or \emph{measurement level}, of each column. +The \textrm{TYPES} command-line argument points to a vector containing +the types of all columns. The types must be provided as per the following +convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal. + +Below we list all univariate statistics computed by script \UnivarScriptName. +The statistics are collected by relevance into several groups, namely: central +tendency, dispersion, shape, and categorical measures. The first three groups +contain statistics computed for a quantitative (also known as: numerical, scale, +or continuous) feature; the last group contains the statistics for a categorical +(either nominal or ordinal) feature. + +Let~$n$ be the number of data records (rows) with feature values. +In what follows we fix a column index \texttt{idx} and consider +sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}. +Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]} +in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$. +Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order, +preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. + +\paragraph{Central tendency measures.} +Sample statistics that describe the location of the quantitative (scale) feature distribution, +represent it with a single value. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Mean] +\OutputRowText{\OutputRowIDMean} +The arithmetic average over a sample of a quantitative feature. +Computed as the ratio between the sum of values and the number of values: +$\left(\sum_{i=1}^n v_i\right)\!/n$. +Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +equals~5.2. + +Note that the mean is significantly affected by extreme values in the sample +and may be misleading as a central tendency measure if the feature varies on +exponential scale. For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$ +is 22.222, greater than all the sample values except the~largest. +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\begin{figure}[t] +\setlength{\unitlength}{10pt} +\begin{picture}(33,12) +\put( 6.2, 0.0){\small 2.2} +\put(10.2, 0.0){\small 3.2} +\put(12.2, 0.0){\small 3.7} +\put(15.0, 0.0){\small 4.4} +\put(18.6, 0.0){\small 5.3} +\put(20.2, 0.0){\small 5.7} +\put(21.75,0.0){\small 6.1} +\put(23.05,0.0){\small 6.4} +\put(26.2, 0.0){\small 7.2} +\put(28.6, 0.0){\small 7.8} +\put( 0.5, 0.7){\small 0.0} +\put( 0.1, 3.2){\small 0.25} +\put( 0.5, 5.7){\small 0.5} +\put( 0.1, 8.2){\small 0.75} +\put( 0.5,10.7){\small 1.0} +\linethickness{1.5pt} +\put( 2.0, 1.0){\line(1,0){4.8}} +\put( 6.8, 1.0){\line(0,1){1.0}} +\put( 6.8, 2.0){\line(1,0){4.0}} +\put(10.8, 2.0){\line(0,1){1.0}} +\put(10.8, 3.0){\line(1,0){2.0}} +\put(12.8, 3.0){\line(0,1){1.0}} +\put(12.8, 4.0){\line(1,0){2.8}} +\put(15.6, 4.0){\line(0,1){1.0}} +\put(15.6, 5.0){\line(1,0){3.6}} +\put(19.2, 5.0){\line(0,1){1.0}} +\put(19.2, 6.0){\line(1,0){1.6}} +\put(20.8, 6.0){\line(0,1){1.0}} +\put(20.8, 7.0){\line(1,0){1.6}} +\put(22.4, 7.0){\line(0,1){1.0}} +\put(22.4, 8.0){\line(1,0){1.2}} +\put(23.6, 8.0){\line(0,1){1.0}} +\put(23.6, 9.0){\line(1,0){3.2}} +\put(26.8, 9.0){\line(0,1){1.0}} +\put(26.8,10.0){\line(1,0){2.4}} +\put(29.2,10.0){\line(0,1){1.0}} +\put(29.2,11.0){\line(1,0){4.8}} +\linethickness{0.3pt} +\put( 6.8, 1.0){\circle*{0.3}} +\put(10.8, 1.0){\circle*{0.3}} +\put(12.8, 1.0){\circle*{0.3}} +\put(15.6, 1.0){\circle*{0.3}} +\put(19.2, 1.0){\circle*{0.3}} +\put(20.8, 1.0){\circle*{0.3}} +\put(22.4, 1.0){\circle*{0.3}} +\put(23.6, 1.0){\circle*{0.3}} +\put(26.8, 1.0){\circle*{0.3}} +\put(29.2, 1.0){\circle*{0.3}} +\put( 6.8, 1.0){\vector(1,0){27.2}} +\put( 2.0, 1.0){\vector(0,1){10.8}} +\put( 2.0, 3.5){\line(1,0){10.8}} +\put( 2.0, 6.0){\line(1,0){17.2}} +\put( 2.0, 8.5){\line(1,0){21.6}} +\put( 2.0,11.0){\line(1,0){27.2}} +\put(12.8, 1.0){\line(0,1){2.0}} +\put(19.2, 1.0){\line(0,1){5.0}} +\put(20.0, 1.0){\line(0,1){5.0}} +\put(23.6, 1.0){\line(0,1){7.0}} +\put( 9.0, 4.0){\line(1,0){3.8}} +\put( 9.2, 2.7){\vector(0,1){0.8}} +\put( 9.2, 4.8){\vector(0,-1){0.8}} +\put(19.4, 8.0){\line(1,0){3.0}} +\put(19.6, 7.2){\vector(0,1){0.8}} +\put(19.6, 9.3){\vector(0,-1){0.8}} +\put(13.0, 2.2){\small $q_{25\%}$} +\put(17.3, 2.2){\small $q_{50\%}$} +\put(23.8, 2.2){\small $q_{75\%}$} +\put(20.15,3.5){\small $\mu$} +\put( 8.0, 3.75){\small $\phi_1$} +\put(18.35,7.8){\small $\phi_2$} +\end{picture} +\label{fig:example_quartiles} +\caption{The computation of quartiles, median, and interquartile mean from the +empirical distribution function of the 10-point +sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$. Each vertical step in +the graph has height~$1{/}n = 0.1$. Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote +the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly; +value~$\mu$ denotes the median. Values $\phi_1$ and $\phi_2$ show the partial contribution +of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.} +\end{figure} + +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Median] +\OutputRowText{\OutputRowIDMedian} +The ``middle'' value that separates the higher half of the sample values +(in a sorted order) from the lower half. +To compute the median, we sort the sample in the increasing order, preserving +duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. +If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$, +same as the $50^{\textrm{th}}$~percentile of the sample. +If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$, +so we compute the median as the mean of these two values. +(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$, +not as the median.) Example: the median of sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}. + +Unlike the mean, the median is not sensitive to extreme values in the sample, +i.e.\ it is robust to outliers. It works better as a measure of central tendency +for heavy-tailed distributions and features that vary on exponential scale. +However, the median is sensitive to small sample size. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Interquartile mean] +\OutputRowText{\OutputRowIDIQMean} +For a sample of a quantitative feature, this is +the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile +and less than or equal the $3^{\textrm{rd}}$ quartile. +In other words, it is a ``truncated mean'' where the lowest 25$\%$ and +the highest 25$\%$ of the sorted values are omitted in its computation. +The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$ +quartiles themselves, contribute to this mean only partially. +This measure is occasionally used as the ``robust'' version of the mean +that is less sensitive to the extreme values. + +To compute the measure, we sort the sample in the increasing order, +preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. +We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index +and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index, +then compute the following weighted mean: +\begin{equation*} +\frac{1}{3{/}4 - 1{/}4} \left[ +\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ +\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i +\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right] +\end{equation*} +In other words, all sample values between the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$ +quartile enter the sum with weights $2{/}n$, times their number of duplicates, while the +two quartiles themselves enter the sum with reduced weights. The weights are proportional +to the vertical steps in the empirical distribution function of the sample, see +Figure~\ref{fig:example_quartiles} for an illustration. +Example: the interquartile mean of sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum +$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$, +which equals~5.31. +\end{Description} + + +\paragraph{Dispersion measures.} +Statistics that describe the amount of variation or spread in a quantitative +(scale) data feature. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Variance] +\OutputRowText{\OutputRowIDVariance} +A measure of dispersion, or spread-out, of sample values around their mean, +expressed in units that are the square of those of the feature itself. +Computed as the sum of squared differences between the values +in the sample and their mean, divided by one less than the number of +values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where +$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. +Example: the variance of sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24. +Note that at least two values ($n\geq 2$) are required to avoid division +by zero. Sample variance is sensitive to outliers, even more than the mean. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Standard deviation] +\OutputRowText{\OutputRowIDStDeviation} +A measure of dispersion around the mean, the square root of variance. +Computed by taking the square root of the sample variance; +see \emph{Variance} above on computing the variance. +Example: the standard deviation of sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8. +At least two values are required to avoid division by zero. +Note that standard deviation is sensitive to outliers. + +Standard deviation is used in conjunction with the mean to determine +an interval containing a given percentage of the feature values, +assuming the normal distribution. In a large sample from a normal +distribution, around 68\% of the cases fall within one standard +deviation and around 95\% of cases fall within two standard deviations +of the mean. For example, if the mean age is 45 with a standard deviation +of 10, around 95\% of the cases would be between 25 and 65 in a normal +distribution. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Coefficient of variation] +\OutputRowText{\OutputRowIDCoeffVar} +The ratio of the standard deviation to the mean, i.e.\ the +\emph{relative} standard deviation, of a quantitative feature sample. +Computed by dividing the sample \emph{standard deviation} by the +sample \emph{mean}, see above for their computation details. +Example: the coefficient of variation for sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +equals 1.8$\,{/}\,$5.2~${\approx}$~0.346. + +This metric is used primarily with non-negative features such as +financial or population data. It is sensitive to outliers. +Note: zero mean causes division by zero, returning infinity or \texttt{NaN}. +At least two values (records) are required to compute the standard deviation. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Minimum] +\OutputRowText{\OutputRowIDMinimum} +The smallest value of a quantitative sample, computed as $\min v = v^s_1$. +Example: the minimum of sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +equals~2.2. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Maximum] +\OutputRowText{\OutputRowIDMaximum} +The largest value of a quantitative sample, computed as $\max v = v^s_n$. +Example: the maximum of sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +equals~7.8. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Range] +\OutputRowText{\OutputRowIDRange} +The difference between the largest and the smallest value of a quantitative +sample, computed as $\max v - \min v = v^s_n - v^s_1$. +It provides information about the overall spread of the sample values. +Example: the range of sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +equals 7.8$\,{-}\,$2.2~${=}$~5.6. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Standard error of the mean] +\OutputRowText{\OutputRowIDStErrorMean} +A measure of how much the value of the sample mean may vary from sample +to sample taken from the same (hypothesized) distribution of the feature. +It helps to roughly bound the distribution mean, i.e.\ +the limit of the sample mean as the sample size tends to infinity. +Under certain assumptions (e.g.\ normality and large sample), the difference +between the distribution mean and the sample mean is unlikely to exceed +2~standard errors. + +The measure is computed by dividing the sample standard deviation +by the square root of the number of values~$n$; see \emph{standard deviation} +for its computation details. Ensure $n\,{\geq}\,2$ to avoid division by~0. +Example: for sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +with the mean of~5.2 the standard error of the mean +equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569. + +Note that the standard error itself is subject to sample randomness. +Its accuracy as an error estimator may be low if the sample size is small +or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has +heavy tails. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +% \item[\it Quartiles] +% \OutputRowText{\OutputRowIDQuartiles} +% %%% dsDefn %%%% +% The values of a quantitative feature +% that divide an ordered/sorted set of data records into four equal-size groups. +% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits +% the sorted data into the lowest $25\%$ and the highest~$75\%$. In other words, +% it is the middle value between the minimum and the median. The $2^{\textrm{nd}}$ +% quartile is the median itself, the value that separates the higher half of +% the data (in the sorted order) from the lower half. Finally, the $3^{\textrm{rd}}$ +% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into +% lowest $75\%$ and highest~$25\%$.\par +% %%% dsComp %%%% +% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical values +% we sort it in the increasing order, preserving duplicates, then return +% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]} +% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$. +% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further adjusted +% to equal the mean of two middle values +% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and +% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$. +% %%% dsWarn %%%% +% We assume that the feature column does not contain \texttt{NaN}s or coded non-numeric values. +% %%% dsExmpl %%% +% \textbf{Example(s).} +\end{Description} + + +\paragraph{Shape measures.} +Statistics that describe the shape and symmetry of the quantitative (scale) +feature distribution estimated from a sample of its values. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Skewness] +\OutputRowText{\OutputRowIDSkewness} +It measures how symmetrically the values of a feature are spread out +around the mean. A significant positive skewness implies a longer (or fatter) +right tail, i.e. feature values tend to lie farther away from the mean on the +right side. A significant negative skewness implies a longer (or fatter) left +tail. The normal distribution is symmetric and has a skewness value of~0; +however, its sample skewness is likely to be nonzero, just close to zero. +As a guideline, a skewness value more than twice its standard error is taken +to indicate a departure from symmetry. + +Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the cube +of the standard deviation. We estimate the $3^{\textrm{rd}}$~central moment as +the sum of cubed differences between the values in the feature column and their +sample mean, divided by the number of values: +$\sum_{i=1}^n (v_i - \bar{v})^3 / n$ +where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. +The standard deviation is computed +as described above in \emph{standard deviation}. To avoid division by~0, +at least two different sample values are required. Example: for sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +with the mean of~5.2 and the standard deviation of~1.8 +skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$. +Note: skewness is sensitive to outliers. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Standard error in skewness] +\OutputRowText{\OutputRowIDStErrorSkewness} +A measure of how much the sample skewness may vary from sample to sample, +assuming that the feature is normally distributed, which makes its +distribution skewness equal~0. +Given the number~$n$ of sample values, the standard error is computed as +\begin{equation*} +\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}} +\end{equation*} +This measure can tell us, for example: +\begin{Itemize} +\item If the sample skewness lands within two standard errors from~0, its +positive or negative sign is non-significant, may just be accidental. +\item If the sample skewness lands outside this interval, the feature +is unlikely to be normally distributed. +\end{Itemize} +At least 3~values ($n\geq 3$) are required to avoid arithmetic failure. +Note that the standard error is inaccurate if the feature distribution is +far from normal or if the number of samples is small. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Kurtosis] +\OutputRowText{\OutputRowIDKurtosis} +As a distribution parameter, kurtosis is a measure of the extent to which +feature values cluster around a central point. In other words, it quantifies +``peakedness'' of the distribution: how tall and sharp the central peak is +relative to a standard bell curve. + +Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative +to a normal distribution: +\begin{Itemize} +\item observations cluster more about the center (peak-shaped), +\item the tails are thinner at non-extreme values, +\item the tails are thicker at extreme values. +\end{Itemize} +Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative +to a normal distribution: +\begin{Itemize} +\item observations cluster less about the center (box-shaped), +\item the tails are thicker at non-extreme values, +\item the tails are thinner at extreme values. +\end{Itemize} +Kurtosis of a normal distribution is zero; however, the sample kurtosis +(computed here) is likely to deviate from zero. + +Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided +by the $4^{\textrm{th}}$~power of the standard deviation, minus~3. +We estimate the $4^{\textrm{th}}$~central moment as the sum of the +$4^{\textrm{th}}$~powers of differences between the values in the feature column +and their sample mean, divided by the number of values: +$\sum_{i=1}^n (v_i - \bar{v})^4 / n$ +where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$. +The standard deviation is computed as described above, see \emph{standard deviation}. + +Note that kurtosis is sensitive to outliers, and requires at least two different +sample values. Example: for sample +$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ +with the mean of~5.2 and the standard deviation of~1.8, +sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Standard error in kurtosis] +\OutputRowText{\OutputRowIDStErrorCurtosis} +A measure of how much the sample kurtosis may vary from sample to sample, +assuming that the feature is normally distributed, which makes its +distribution kurtosis equal~0. +Given the number~$n$ of sample values, the standard error is computed as +\begin{equation*} +\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}} +\end{equation*} +This measure can tell us, for example: +\begin{Itemize} +\item If the sample kurtosis lands within two standard errors from~0, its +positive or negative sign is non-significant, may just be accidental. +\item If the sample kurtosis lands outside this interval, the feature +is unlikely to be normally distributed. +\end{Itemize} +At least 4~values ($n\geq 4$) are required to avoid arithmetic failure. +Note that the standard error is inaccurate if the feature distribution is +far from normal or if the number of samples is small. +\end{Description} + + +\paragraph{Categorical measures.} Statistics that describe the sample of +a categorical feature, either nominal or ordinal. We represent all +categories by integers from~1 to the number of categories; we call +these integers \emph{category~IDs}. +\begin{Description} +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Number of categories] +\OutputRowText{\OutputRowIDNumCategories} +The maximum category~ID that occurs in the sample. Note that some +categories with~IDs \emph{smaller} than this maximum~ID may have +no~occurrences in the sample, without reducing the number of categories. +However, any categories with~IDs \emph{larger} than the maximum~ID with +no occurrences in the sample will not be counted. +Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ +the number of categories is reported as~8. Category~IDs 2 and~6, which have +zero occurrences, are still counted; but if there is a category with +ID${}=9$ and zero occurrences, it is not counted. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Mode] +\OutputRowText{\OutputRowIDMode} +The most frequently occurring category value. +If several values share the greatest frequency of occurrence, then each +of them is a mode; but here we report only the smallest of these modes. +Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ +the modes are 3 and~7, with 3 reported. + +Computed by counting the number of occurrences for each category, +then taking the smallest category~ID that has the maximum count. +Note that the sample modes may be different from the distribution modes, +i.e.\ the categories whose (hypothesized) underlying probability is the +maximum over all categories. +%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%% +\item[\it Number of modes] +\OutputRowText{\OutputRowIDNumModes} +The number of category values that each have the largest frequency +count in the sample. +Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$ +there are two category IDs (3 and~7) that occur the maximum count of 4~times; +hence, we return~2. + +Computed by counting the number of occurrences for each category, +then counting how many categories have the maximum count. +Note that the sample modes may be different from the distribution modes, +i.e.\ the categories whose (hypothesized) underlying probability is the +maximum over all categories. +\end{Description} + + +\smallskip +\noindent{\bf Returns} +\smallskip + +The output matrix containing all computed statistics is of size $17$~rows and +as many columns as in the input matrix~\texttt{X}. Each row corresponds to +a particular statistic, according to the convention specified in +Table~\ref{table:univars}. The first $14$~statistics are applicable for +\emph{scale} columns, and the last $3$~statistics are applicable for categorical, +i.e.\ nominal and ordinal, columns. + + +\pagebreak[2] + +\smallskip +\noindent{\bf Examples} +\smallskip + +{\hangindent=\parindent\noindent\tt +\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx + TYPES=/user/biadmin/types.mtx + STATS=/user/biadmin/stats.mtx + +}