http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/LinReg.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/LinReg.tex b/Algorithms Reference/LinReg.tex deleted file mode 100644 index 67273c2..0000000 --- a/Algorithms Reference/LinReg.tex +++ /dev/null @@ -1,328 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Linear Regression} -\label{sec:LinReg} - -\noindent{\bf Description} -\smallskip - -Linear Regression scripts are used to model the relationship between one numerical -response variable and one or more explanatory (feature) variables. -The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a -numerical vector of feature variables and $y_i$ is a numerical response value for -each training data record. The feature vectors are provided as a matrix $X$ of size -$n\,{\times}\,m$, where $n$ is the number of records and $m$ is the number of features. -The observed response values are provided as a 1-column matrix~$Y$, with a numerical -value $y_i$ for each~$x_i$ in the corresponding row of matrix~$X$. - -In linear regression, we predict the distribution of the response~$y_i$ based on -a fixed linear combination of the features in~$x_i$. We assume that -there exist constant regression coefficients $\beta_0, \beta_1, \ldots, \beta_m$ -and a constant residual variance~$\sigma^2$ such that -\begin{equation} -y_i \sim \Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\, -\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} -\label{eqn:linregdef} -\end{equation} -Distribution $y_i \sim \Normal(\mu_i, \sigma^2)$ models the ``unexplained'' residual -noise and is assumed independent across different records. - -The goal is to estimate the regression coefficients and the residual variance. -Once they are accurately estimated, we can make predictions about $y_i$ given~$x_i$ -in new records. We can also use the $\beta_j$'s to analyze the influence of individual -features on the response value, and assess the quality of this model by comparing -residual variance in the response, left after prediction, with its total variance. - -There are two scripts in our library, both doing the same estimation, but using different -computational methods. Depending on the size and the sparsity of the feature matrix~$X$, -one or the other script may be more efficient. The ``direct solve'' script -{\tt LinearRegDS} is more efficient when the number of features $m$ is relatively small -($m \sim 1000$ or less) and matrix~$X$ is either tall or fairly dense -(has~${\gg}\:m^2$ nonzeros); otherwise, the ``conjugate gradient'' script {\tt LinearRegCG} -is more efficient. If $m > 50000$, use only {\tt LinearRegCG}. - -\smallskip -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it% -{\tt{}-f }path/\/{\tt{}LinearRegDS.dml} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} Y=}path/file -{\tt{} B=}path/file -{\tt{} O=}path/file -{\tt{} icpt=}int -{\tt{} reg=}double -{\tt{} fmt=}format - -}\smallskip -{\hangindent=\parindent\noindent\it% -{\tt{}-f }path/\/{\tt{}LinearRegCG.dml} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} Y=}path/file -{\tt{} B=}path/file -{\tt{} O=}path/file -{\tt{} Log=}path/file -{\tt{} icpt=}int -{\tt{} reg=}double -{\tt{} tol=}double -{\tt{} maxi=}int -{\tt{} fmt=}format - -} - -\smallskip -\noindent{\bf Arguments} -\begin{Description} -\item[{\tt X}:] -Location (on HDFS) to read the matrix of feature vectors, each row constitutes -one feature vector -\item[{\tt Y}:] -Location to read the 1-column matrix of response values -\item[{\tt B}:] -Location to store the estimated regression parameters (the $\beta_j$'s), with the -intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available -\item[{\tt O}:] (default:\mbox{ }{\tt " "}) -Location to store the CSV-file of summary statistics defined in -Table~\ref{table:linreg:stats}, the default is to print it to the standard output -\item[{\tt Log}:] (default:\mbox{ }{\tt " "}, {\tt LinearRegCG} only) -Location to store iteration-specific variables for monitoring and debugging purposes, -see Table~\ref{table:linreg:log} for details. -\item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) -Intercept presence and shifting/rescaling the features in~$X$:\\ -{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\ -{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ -{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 -\item[{\tt reg}:] (default:\mbox{ }{\tt 0.000001}) -L2-regularization parameter~\mbox{$\lambda\geq 0$}; set to nonzero for highly dependent, -sparse, or numerous ($m \gtrsim n/10$) features -\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}, {\tt LinearRegCG} only) -Tolerance \mbox{$\eps\geq 0$} used in the convergence criterion: we terminate conjugate -gradient iterations when the $\beta$-residual reduces in L2-norm by this factor -\item[{\tt maxi}:] (default:\mbox{ }{\tt 0}, {\tt LinearRegCG} only) -Maximum number of conjugate gradient iterations, or~0 if no maximum -limit provided -\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) -Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; -see read/write functions in SystemML Language Reference for details. -\end{Description} - - -\begin{table}[t]\small\centerline{% -\begin{tabular}{|ll|} -\hline -Name & Meaning \\ -\hline -{\tt AVG\_TOT\_Y} & Average of the response value $Y$ \\ -{\tt STDEV\_TOT\_Y} & Standard Deviation of the response value $Y$ \\ -{\tt AVG\_RES\_Y} & Average of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$, i.e.\ residual bias \\ -{\tt STDEV\_RES\_Y} & Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$ \\ -{\tt DISPERSION} & GLM-style dispersion, i.e.\ residual sum of squares / \#deg.\ fr. \\ -{\tt PLAIN\_R2} & Plain $R^2$ of residual with bias included vs.\ total average \\ -{\tt ADJUSTED\_R2} & Adjusted $R^2$ of residual with bias included vs.\ total average \\ -{\tt PLAIN\_R2\_NOBIAS} & Plain $R^2$ of residual with bias subtracted vs.\ total average \\ -{\tt ADJUSTED\_R2\_NOBIAS} & Adjusted $R^2$ of residual with bias subtracted vs.\ total average \\ -{\tt PLAIN\_R2\_VS\_0} & ${}^*$Plain $R^2$ of residual with bias included vs.\ zero constant \\ -{\tt ADJUSTED\_R2\_VS\_0} & ${}^*$Adjusted $R^2$ of residual with bias included vs.\ zero constant \\ -\hline -\multicolumn{2}{r}{${}^{*\mathstrut}$ The last two statistics are only printed if there is no intercept ({\tt icpt=0})} \\ -\end{tabular}} -\caption{Besides~$\beta$, linear regression scripts compute a few summary statistics -listed above. The statistics are provided in CSV format, one comma-separated name-value -pair per each line.} -\label{table:linreg:stats} -\end{table} - -\begin{table}[t]\small\centerline{% -\begin{tabular}{|ll|} -\hline -Name & Meaning \\ -\hline -{\tt CG\_RESIDUAL\_NORM} & L2-norm of conjug.\ grad.\ residual, which is $A \pxp \beta - t(X) \pxp y$ \\ - & where $A = t(X) \pxp X + \diag (\lambda)$, or a similar quantity \\ -{\tt CG\_RESIDUAL\_RATIO} & Ratio of current L2-norm of conjug.\ grad.\ residual over the initial \\ -\hline -\end{tabular}} -\caption{ -The {\tt Log} file for {\tt{}LinearRegCG} script contains the above \mbox{per-}iteration -variables in CSV format, each line containing triple (Name, Iteration\#, Value) with -Iteration\# being~0 for initial values.} -\label{table:linreg:log} -\end{table} - - -\noindent{\bf Details} -\smallskip - -To solve a linear regression problem over feature matrix~$X$ and response vector~$Y$, -we can find coefficients $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize -the joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed statistical -model~(\ref{eqn:linregdef}). Since the joint likelihood of the independent -$y_i \sim \Normal(\mu_i, \sigma^2)$ is proportional to the product of -$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the logarithm of this -product, then multiply by $-2\sigma^2 < 0$ to obtain a least squares problem: -\begin{equation} -\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, -\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 -\,\,\to\,\,\min -\label{eqn:linregls} -\end{equation} -This may not be enough, however. The minimum may sometimes be attained over infinitely many -$\beta$-vectors, for example if $X$ has an all-0 column, or has linearly dependent columns, -or has fewer rows than columns~\mbox{($n < m$)}. Even if~(\ref{eqn:linregls}) has a unique -solution, other $\beta$-vectors may be just a little suboptimal\footnote{Smaller likelihood -difference between two models suggests less statistical evidence to pick one model over the -other.}, yet give significantly different predictions for new feature vectors. This results -in \emph{overfitting}: prediction error for the training data ($X$ and~$Y$) is much smaller -than for the test data (new records). - -Overfitting and degeneracy in the data is commonly mitigated by adding a regularization penalty -term to the least squares function: -\begin{equation} -\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2 -\,+\,\, \lambda \sum_{j=1}^m \beta_j^2 -\,\,\to\,\,\min -\label{eqn:linreglsreg} -\end{equation} -The choice of $\lambda>0$, the regularization constant, typically involves cross-validation -where the dataset is repeatedly split into a training part (to estimate the~$\beta_j$'s) and -a test part (to evaluate prediction accuracy), with the goal of maximizing the test accuracy. -In our scripts, $\lambda$~is provided as input parameter~{\tt reg}. - -The solution to least squares problem~(\ref{eqn:linreglsreg}), through taking the derivative -and setting it to~0, has the matrix linear equation form -\begin{equation} -A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\, -\textrm{where}\,\,\, -A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} \diag(\hspace{0.5pt} -\underbrace{\raisebox{0pt}[0pt][0.5pt]{$\lambda,\ldots, \lambda$}}_{\raisebox{2pt}{$\scriptstyle m$}} -\hspace{0.5pt}, 0) -\label{eqn:linregeq} -\end{equation} -where $[X,\,1]$ is $X$~with an extra column of~1s appended on the right, and the -diagonal matrix of $\lambda$'s has a zero to keep the intercept~$\beta_0$ unregularized. -If the intercept is disabled by setting {\tt icpt=0}, the equation is simply -\mbox{$X^T X \beta = X^T Y$}. - -We implemented two scripts for solving equation~(\ref{eqn:linregeq}): one is a ``direct solver'' -that computes $A$ and then solves $A\beta = [X,\,1]^T Y$ by calling an external package, -the other performs linear conjugate gradient~(CG) iterations without ever materializing~$A$. -The CG~algorithm closely follows Algorithm~5.2 in Chapter~5 of~\cite{Nocedal2006:Optimization}. -Each step in the CG~algorithm computes a matrix-vector multiplication $q = Ap$ by first computing -$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$. Usually the number of such multiplications, -one per CG iteration, is much smaller than~$m$. The user can put a hard bound on it with input -parameter~{\tt maxi}, or use the default maximum of~$m+1$ (or~$m$ if no intercept) by -having {\tt maxi=0}. The CG~iterations terminate when the L2-norm of vector -$r = A\beta - [X,\,1]^T Y$ decreases from its initial value (for~$\beta=0$) by the tolerance -factor specified in input parameter~{\tt tol}. - -The CG algorithm is more efficient if computing -$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$, -an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix. The Direct Solver~(DS) is more efficient if -$X$ takes up a lot more memory than $A$ (i.e.\ $X$~has a lot more nonzeros than~$m^2$) -and if $m^2$ is small enough for the external solver ($m \lesssim 50000$). A more precise -determination between CG and~DS is subject to further research. - -In addition to the $\beta$-vector, the scripts estimate the residual standard -deviation~$\sigma$ and the~$R^2$, the ratio of ``explained'' variance to the total -variance of the response variable. These statistics only make sense if the number -of degrees of freedom $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant -$\lambda$ is negligible or zero. The formulas for $\sigma$ and $R^2$~are: -\begin{equation*} -R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad -\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad -R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}} -\end{equation*} -where -\begin{equation*} -\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - -\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad -\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2 -\end{equation*} -Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the estimated -regression coefficients and the feature vectors. They may be biased when no -intercept is present, hence the RSS formula subtracts the bias. - -Lastly, note that by choosing the input option {\tt icpt=2} the user can shift -and rescale the columns of~$X$ to have zero average and the variance of~1. -This is particularly important when using regularization over highly disbalanced -features, because regularization tends to penalize small-variance columns (which -need large~$\beta_j$'s) more than large-variance columns (with small~$\beta_j$'s). -At the end, the estimated regression coefficients are shifted and rescaled to -apply to the original features. - -\smallskip -\noindent{\bf Returns} -\smallskip - -The estimated regression coefficients (the $\hat{\beta}_j$'s) are populated into -a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}'' -input argument. What this matrix contains, and its size, depends on the input -argument {\tt icpt}, which specifies the user's intercept and rescaling choice: -\begin{Description} -\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with -$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$. -\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$ -has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$, -and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient. -\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to -mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of -the~$\hat{\beta}_j$'s, one for the original features and another for the -shifted/rescaled features. Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with -$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled -features, in the above format. Note that $B[\cdot, 2]$ are iteratively estimated -and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and -rescaling. -\end{Description} -The estimated summary statistics, including residual standard deviation~$\sigma$ and -the~$R^2$, are printed out or sent into a file (if specified) in CSV format as -defined in Table~\ref{table:linreg:stats}. For conjugate gradient iterations, -a log file with monitoring variables can also be made available, see -Table~\ref{table:linreg:log}. - -\smallskip -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt -\hml -f LinearRegCG.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx - B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv - icpt=2 reg=1.0 tol=0.00000001 maxi=100 Log=/user/biadmin/log.csv - -} -{\hangindent=\parindent\noindent\tt -\hml -f LinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx - B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv - icpt=2 reg=1.0 - -} - -% \smallskip -% \noindent{\bf See Also} -% \smallskip -% -% In case of binary classification problems, please consider using L2-SVM or -% binary logistic regression; for multiclass classification, use multiclass~SVM -% or multinomial logistic regression. For more complex distributions of the -% response variable use the Generalized Linear Models script.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/LogReg.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/LogReg.tex b/Algorithms Reference/LogReg.tex deleted file mode 100644 index 43d4e15..0000000 --- a/Algorithms Reference/LogReg.tex +++ /dev/null @@ -1,287 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Multinomial Logistic Regression} - -\noindent{\bf Description} -\smallskip - -Our logistic regression script performs both binomial and multinomial logistic regression. -The script is given a dataset $(X, Y)$ where matrix $X$ has $m$~columns and matrix $Y$ has -one column; both $X$ and~$Y$ have $n$~rows. The rows of $X$ and~$Y$ are viewed as a collection -of records: $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of explanatory -(feature) variables and $y_i$ is a categorical response variable. -Each row~$x_i$ in~$X$ has size~\mbox{$\dim x_i = m$}, while its corresponding $y_i$ is an -integer that represents the observed response value for record~$i$. - -The goal of logistic regression is to learn a linear model over the feature vector -$x_i$ that can be used to predict how likely each categorical label is expected to -be observed as the actual~$y_i$. -Note that logistic regression predicts more than a label: it predicts the probability -for every possible label. The binomial case allows only two possible labels, the -multinomial case has no such restriction. - -Just as linear regression estimates the mean value $\mu_i$ of a numerical response -variable, logistic regression does the same for category label probabilities. -In linear regression, the mean of $y_i$ is estimated as a linear combination of the features: -$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$. -In logistic regression, the -label probability has to lie between 0 and~1, so a link function is applied to connect -it to $\beta_0 + x_i\beta_{1:m}$. If there are just two possible category labels, for example -0~and~1, the logistic link looks as follows: -\begin{equation*} -\Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\, -\frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}; -\quad -\Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\, -\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}} -\end{equation*} -Here category label~0 serves as the \emph{baseline}, and function -$\exp(\beta_0 + x_i\beta_{1:m})$ -shows how likely we expect to see ``$y_i = 1$'' in comparison to the baseline. -Like in a loaded coin, the predicted odds of seeing 1~versus~0 are -$\exp(\beta_0 + x_i\beta_{1:m})$ to~1, -with each feature $x_{i,j}$ multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds. -Given a large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic regression seeks -to find the $\beta_j$'s that maximize the product of probabilities -\hbox{$\Prob[y_i\mid x_i; \beta]$} -for actually observed $y_i$-labels (assuming no regularization). - -Multinomial logistic regression~\cite{Agresti2002:CDA} extends this link to $k \geq 3$ possible -categories. Again we identify one category as the baseline, for example the $k$-th category. -Instead of a coin, here we have a loaded multisided die, one side per category. Each non-baseline -category $l = 1\ldots k\,{-}\,1$ has its own vector $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$ -of regression parameters with the intercept, making up a matrix $B$ of size -$(m\,{+}\,1)\times(k\,{-}\,1)$. The predicted odds of seeing non-baseline category~$l$ versus -the baseline~$k$ are $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$ -to~1, and the predicted probabilities are: -\begin{align} -l < k:\quad\Prob[y_i\,{=}\,\makebox[0.5em][c]{$l$}\mid x_i; B] \,\,\,{=}\,\,\,& -\frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}% -{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}; -\label{eqn:mlogreg:nonbaseprob}\\ -\Prob[y_i\,{=}\,\makebox[0.5em][c]{$k$}\mid x_i; B] \,\,\,{=}\,\,\,& \frac{1}% -{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}. -\label{eqn:mlogreg:baseprob} -\end{align} -The goal of the regression is to estimate the parameter matrix~$B$ from the provided dataset -$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of \hbox{$\Prob[y_i\mid x_i; B]$} -over the observed labels~$y_i$. Taking its logarithm, negating, and adding a regularization term -gives us a minimization objective: -\begin{equation} -f(B; X, Y) \,\,=\,\, --\sum_{i=1}^n \,\log \Prob[y_i\mid x_i; B] \,+\, -\frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2 -\,\,\to\,\,\min -\label{eqn:mlogreg:loss} -\end{equation} -The optional regularization term is added to mitigate overfitting and degeneracy in the data; -to reduce bias, the intercepts $\beta_{0,l}$ are not regularized. Once the~$\beta_{j,l}$'s -are accurately estimated, we can make predictions about the category label~$y$ for a new -feature vector~$x$ using Eqs.~(\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}). - -\smallskip -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it% -{\tt{}-f }path/\/{\tt{}MultiLogReg.dml} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} Y=}path/file -{\tt{} B=}path/file -{\tt{} Log=}path/file -{\tt{} icpt=}int -{\tt{} reg=}double -{\tt{} tol=}double -{\tt{} moi=}int -{\tt{} mii=}int -{\tt{} fmt=}format - -} - - -\smallskip -\noindent{\bf Arguments} -\begin{Description} -\item[{\tt X}:] -Location (on HDFS) to read the input matrix of feature vectors; each row constitutes -one feature vector. -\item[{\tt Y}:] -Location to read the input one-column matrix of category labels that correspond to -feature vectors in~{\tt X}. Note the following:\\ --- Each non-baseline category label must be a positive integer.\\ --- If all labels are positive, the largest represents the baseline category.\\ --- If non-positive labels such as $-1$ or~$0$ are present, then they represent the (same) -baseline category and are converted to label $\max(\texttt{Y})\,{+}\,1$. -\item[{\tt B}:] -Location to store the matrix of estimated regression parameters (the $\beta_{j, l}$'s), -with the intercept parameters~$\beta_{0, l}$ at position {\tt B[}$m\,{+}\,1$, $l${\tt ]} -if available. The size of {\tt B} is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the intercepts -or $m \times (k\,{-}\,1)$ without the intercepts, one column per non-baseline category -and one row per feature. -\item[{\tt Log}:] (default:\mbox{ }{\tt " "}) -Location to store iteration-specific variables for monitoring and debugging purposes, -see Table~\ref{table:mlogreg:log} for details. -\item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) -Intercept and shifting/rescaling of the features in~$X$:\\ -{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\ -{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ -{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 -\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0}) -L2-regularization parameter (lambda) -\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}) -Tolerance (epsilon) used in the convergence criterion -\item[{\tt moi}:] (default:\mbox{ }{\tt 100}) -Maximum number of outer (Fisher scoring) iterations -\item[{\tt mii}:] (default:\mbox{ }{\tt 0}) -Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum -limit provided -\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) -Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; -see read/write functions in SystemML Language Reference for details. -\end{Description} - - -\begin{table}[t]\small\centerline{% -\begin{tabular}{|ll|} -\hline -Name & Meaning \\ -\hline -{\tt LINEAR\_TERM\_MIN} & The minimum value of $X \pxp B$, used to check for overflows \\ -{\tt LINEAR\_TERM\_MAX} & The maximum value of $X \pxp B$, used to check for overflows \\ -{\tt NUM\_CG\_ITERS} & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\ -{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\ -{\tt POINT\_STEP\_NORM} & L2-norm of iteration step from old point (matrix $B$) to new point \\ -{\tt OBJECTIVE} & The loss function we minimize (negative regularized log-likelihood) \\ -{\tt OBJ\_DROP\_REAL} & Reduction in the objective during this iteration, actual value \\ -{\tt OBJ\_DROP\_PRED} & Reduction in the objective predicted by a quadratic approximation \\ -{\tt OBJ\_DROP\_RATIO} & Actual-to-predicted reduction ratio, used to update the trust region \\ -{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\ -{\tt GRADIENT\_NORM} & L2-norm of the loss function gradient (omitted if point is rejected) \\ -{\tt TRUST\_DELTA} & Updated trust region size, the ``delta'' \\ -\hline -\end{tabular}} -\caption{ -The {\tt Log} file for multinomial logistic regression contains the above \mbox{per-}iteration -variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\# -being~0 for initial values.} -\label{table:mlogreg:log} -\end{table} - - -\noindent{\bf Details} -\smallskip - -We estimate the logistic regression parameters via L2-regularized negative -log-likelihood minimization~(\ref{eqn:mlogreg:loss}). -The optimization method used in the script closely follows the trust region -Newton method for logistic regression described in~\cite{Lin2008:logistic}. -For convenience, let us make some changes in notation: -\begin{Itemize} -\item Convert the input vector of observed category labels into an indicator matrix $Y$ -of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th category label is~$l$ and -$Y_{i, l} = 0$ otherwise; -\item Append an extra column of all ones, i.e.\ $(1, 1, \ldots, 1)^T$, as the -$m\,{+}\,1$-st column to the feature matrix $X$ to represent the intercept; -\item Append an all-zero column as the $k$-th column to $B$, the matrix of regression -parameters, to represent the baseline category; -\item Convert the regularization constant $\lambda$ into matrix $\Lambda$ of the same -size as $B$, placing 0's into the $m\,{+}\,1$-st row to disable intercept regularization, -and placing $\lambda$'s everywhere else. -\end{Itemize} -Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given -by (\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}) -and the objective function $f$ in~(\ref{eqn:mlogreg:loss}) have the matrix form -\begin{align*} -P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\ -f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \, -{\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \, -(1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B -\end{align*} -where operations $\cdot\,$, $/$, $\exp$, and $\log$ are applied cellwise, -and $\textstyle\sum$ denotes the sum of all cells in a matrix. -The gradient of~$f$ with respect to~$B$ can be represented as a matrix too: -\begin{equation*} -\nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B -\end{equation*} -The Hessian $\mathcal{H}$ of~$f$ is a tensor, but, fortunately, the conjugate -gradient inner loop of the trust region algorithm in~\cite{Lin2008:logistic} -does not need to instantiate it. We only need to multiply $\mathcal{H}$ by -ordinary matrices of the same size as $B$ and $\nabla f$, and this can be done -in matrix form: -\begin{equation*} -\mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\, -\Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV) -\end{equation*} -At each Newton iteration (the \emph{outer} iteration) the minimization algorithm -approximates the difference $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$ -attained in the objective function after a step $B \mapsto B\,{+}\,S$ by a -second-degree formula -\begin{equation*} -\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S - \,+\, {\textstyle\sum}\,\,S\cdot \nabla f -\end{equation*} -This approximation is then minimized by trust-region conjugate gradient iterations -(the \emph{inner} iterations) subject to the constraint $\|S\|_2 \leq \delta$. -The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$ -and updated as described in~\cite{Lin2008:logistic}. -Users can specify the maximum number of the outer and the inner iterations with -input parameters {\tt moi} and {\tt mii}, respectively. The iterative minimizer -terminates successfully if $\|\nabla f\|_2 < \eps\,\|\nabla f_{B=0}\|_2$, -where $\eps > 0$ is a tolerance supplied by the user via input parameter~{\tt tol}. - -\smallskip -\noindent{\bf Returns} -\smallskip - -The estimated regression parameters (the $\hat{\beta}_{j, l}$) are populated into -a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}'' -input argument. Only the non-baseline categories ($1\leq l \leq k\,{-}\,1$) have -their $\hat{\beta}_{j, l}$ in the output; to add the baseline category, just append -a column of zeros. If {\tt icpt=0} in the input command line, no intercepts are used -and {\tt B} has size $m\times (k\,{-}\,1)$; otherwise {\tt B} has size -$(m\,{+}\,1)\times (k\,{-}\,1)$ -and the intercepts are in the $m\,{+}\,1$-st row. If {\tt icpt=2}, then initially -the feature columns in~$X$ are shifted to mean${} = 0$ and rescaled to variance${} = 1$. -After the iterations converge, the $\hat{\beta}_{j, l}$'s are rescaled and shifted -to work with the original features. - - -\smallskip -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt -\hml -f MultiLogReg.dml -nvargs X=/user/biadmin/X.mtx - Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx fmt=csv - icpt=2 reg=1.0 tol=0.0001 moi=100 mii=10 Log=/user/biadmin/log.csv - -} - - -\smallskip -\noindent{\bf References} -\begin{itemize} -\item A.~Agresti. -\newblock {\em Categorical Data Analysis}. -\newblock Wiley Series in Probability and Statistics. Wiley-Interscience, second edition, 2002. -\end{itemize} http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/MultiSVM.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/MultiSVM.tex b/Algorithms Reference/MultiSVM.tex deleted file mode 100644 index 87880a9..0000000 --- a/Algorithms Reference/MultiSVM.tex +++ /dev/null @@ -1,174 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsubsection{Multi-class Support Vector Machines} -\label{msvm} - -\noindent{\bf Description} - -Support Vector Machines are used to model the relationship between a categorical -dependent variable y and one or more explanatory variables denoted X. This -implementation supports dependent variables that have domain size greater or -equal to 2 and hence is not restricted to binary class labels. -\\ - -\noindent{\bf Usage} - -\begin{tabbing} -\texttt{-f} \textit{path}/\texttt{m-svm.dml -nvargs} -\=\texttt{X=}\textit{path}/\textit{file} - \texttt{Y=}\textit{path}/\textit{file} - \texttt{icpt=}\textit{int}\\ -\>\texttt{tol=}\textit{double} - \texttt{reg=}\textit{double} - \texttt{maxiter=}\textit{int} - \texttt{model=}\textit{path}/\textit{file}\\ -\>\texttt{Log=}\textit{path}/\textit{file} - \texttt{fmt=}\textit{csv}$\vert$\textit{text} -\end{tabbing} - -\begin{tabbing} -\texttt{-f} \textit{path}/\texttt{m-svm-predict.dml -nvargs} -\=\texttt{X=}\textit{path}/\textit{file} - \texttt{Y=}\textit{path}/\textit{file} - \texttt{icpt=}\textit{int} - \texttt{model=}\textit{path}/\textit{file}\\ -\>\texttt{scores=}\textit{path}/\textit{file} - \texttt{accuracy=}\textit{path}/\textit{file}\\ -\>\texttt{confusion=}\textit{path}/\textit{file} - \texttt{fmt=}\textit{csv}$\vert$\textit{text} -\end{tabbing} - -\noindent{\bf Arguments} - -\begin{itemize} -\item X: Location (on HDFS) containing the explanatory variables -in a matrix. Each row constitutes an example. -\item Y: Location (on HDFS) containing a 1-column matrix specifying -the categorical dependent variable (label). Labels are assumed to be -contiguously numbered from 1 $\ldots$ \#classes. Note that, this -argument is optional for prediction. -\item icpt (default: {\tt 0}): If set to 1 then a constant bias column -is added to X. -\item tol (default: {\tt 0.001}): Procedure terminates early if the reduction -in objective function value is less than tolerance times the initial objective -function value. -\item reg (default: {\tt 1}): Regularization constant. See details to find -out where lambda appears in the objective function. If one were interested -in drawing an analogy with C-SVM, then C = 2/lambda. Usually, cross validation -is employed to determine the optimum value of lambda. -\item maxiter (default: {\tt 100}): The maximum number of iterations. -\item model: Location (on HDFS) that contains the learnt weights. -\item Log: Location (on HDFS) to collect various metrics (e.g., objective -function value etc.) that depict progress across iterations while training. -\item fmt (default: {\tt text}): Specifies the output format. Choice of -comma-separated values (csv) or as a sparse-matrix (text). -\item scores: Location (on HDFS) to store scores for a held-out test set. -Note that, this is an optional argument. -\item accuracy: Location (on HDFS) to store the accuracy computed on a -held-out test set. Note that, this is an optional argument. -\item confusion: Location (on HDFS) to store the confusion matrix -computed using a held-out test set. Note that, this is an optional -argument. -\end{itemize} - -\noindent{\bf Details} - -Support vector machines learn a classification function by solving the -following optimization problem ($L_2$-SVM): -\begin{eqnarray*} -&\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\ -&\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i -\end{eqnarray*} -where $x_i$ is an example from the training set with its label given by $y_i$, -$w$ is the vector of parameters and $\lambda$ is the regularization constant -specified by the user. - -To extend the above formulation (binary class SVM) to the multiclass setting, -one standard approache is to learn one binary class SVM per class that -separates data belonging to that class from the rest of the training data -(one-against-the-rest SVM, see C. Scholkopf, 1995). - -To account for the missing bias term, one may augment the data with a column -of constants which is achieved by setting intercept argument to 1 (C-J Hsieh -et al, 2008). - -This implementation optimizes the primal directly (Chapelle, 2007). It uses -nonlinear conjugate gradient descent to minimize the objective function -coupled with choosing step-sizes by performing one-dimensional Newton -minimization in the direction of the gradient. -\\ - -\noindent{\bf Returns} - -The learnt weights produced by m-svm.dml are populated into a matrix that -has as many columns as there are classes in the training data, and written -to file provided on HDFS (see model in section Arguments). The number of rows -in this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1 -otherwise. The bias terms, if used, are placed in the last row. Depending on what -arguments are provided during invocation, m-svm-predict.dml may compute one or more -of scores, accuracy and confusion matrix in the output format specified. -\\ - -%%\noindent{\bf See Also} -%% -%%In case of binary classification problems, please consider using a binary class classifier -%%learning algorithm, e.g., binary class $L_2$-SVM (see Section \ref{l2svm}) or logistic regression -%%(see Section \ref{logreg}). To model the relationship between a scalar dependent variable -%%y and one or more explanatory variables X, consider Linear Regression instead (see Section -%%\ref{linreg-solver} or Section \ref{linreg-iterative}). -%%\\ -%% -\noindent{\bf Examples} -\begin{verbatim} -hadoop jar SystemML.jar -f m-svm.dml -nvargs X=/user/biadmin/X.mtx - Y=/user/biadmin/y.mtx - icpt=0 tol=0.001 - reg=1.0 maxiter=100 fmt=csv - model=/user/biadmin/weights.csv - Log=/user/biadmin/Log.csv -\end{verbatim} - -\begin{verbatim} -hadoop jar SystemML.jar -f m-svm-predict.dml -nvargs X=/user/biadmin/X.mtx - Y=/user/biadmin/y.mtx - icpt=0 fmt=csv - model=/user/biadmin/weights.csv - scores=/user/biadmin/scores.csv - accuracy=/user/biadmin/accuracy.csv - confusion=/user/biadmin/confusion.csv -\end{verbatim} - -\noindent{\bf References} - -\begin{itemize} -\item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in -Numerical Recipes in C - The Art in Scientific Computing.} \newblock W. H. Press and S. A. Teukolsky -(eds.), Cambridge University Press, 1992. -\item J. Nocedal and S. J. Wright. \newblock{\em Numerical Optimization.} \newblock Springer-Verlag, 1999. -\item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock {\em A Dual Coordinate -Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning -(ICML), 2008. -\item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal.} \newblock Neural -Computation, 2007. -\item B. Scholkopf, C. Burges and V. Vapnik. \newblock{\em Extracting Support Data for a Given Task.} \newblock International Conference on Knowledge Discovery and Data Mining (ICDM), 1995. -\end{itemize} - http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/NaiveBayes.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/NaiveBayes.tex b/Algorithms Reference/NaiveBayes.tex deleted file mode 100644 index b5f721d..0000000 --- a/Algorithms Reference/NaiveBayes.tex +++ /dev/null @@ -1,155 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Naive Bayes} -\label{naive_bayes} - -\noindent{\bf Description} - -Naive Bayes is very simple generative model used for classifying data. -This implementation learns a multinomial naive Bayes classifier which -is applicable when all features are counts of categorical values. -\\ - -\noindent{\bf Usage} - -\begin{tabbing} -\texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs} -\=\texttt{X=}\textit{path}/\textit{file} - \texttt{Y=}\textit{path}/\textit{file} - \texttt{laplace=}\textit{double}\\ -\>\texttt{prior=}\textit{path}/\textit{file} - \texttt{conditionals=}\textit{path}/\textit{file}\\ -\>\texttt{accuracy=}\textit{path}/\textit{file} - \texttt{fmt=}\textit{csv}$\vert$\textit{text} -\end{tabbing} - -\begin{tabbing} -\texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs} -\=\texttt{X=}\textit{path}/\textit{file} - \texttt{Y=}\textit{path}/\textit{file} - \texttt{prior=}\textit{path}/\textit{file}\\ -\>\texttt{conditionals=}\textit{path}/\textit{file} - \texttt{fmt=}\textit{csv}$\vert$\textit{text}\\ -\>\texttt{accuracy=}\textit{path}/\textit{file} - \texttt{confusion=}\textit{path}/\textit{file}\\ -\>\texttt{probabilities=}\textit{path}/\textit{file} -\end{tabbing} - -\noindent{\bf Arguments} - -\begin{itemize} -\item X: Location (on HDFS) to read the matrix of feature vectors; -each row constitutes one feature vector. -\item Y: Location (on HDFS) to read the one-column matrix of (categorical) -labels that correspond to feature vectors in X. Classes are assumed to be -contiguously labeled beginning from 1. Note that, this argument is optional -for prediction. -\item laplace (default: {\tt 1}): Laplace smoothing specified by the -user to avoid creation of 0 probabilities. -\item prior: Location (on HDFS) that contains the class prior probabilites. -\item conditionals: Location (on HDFS) that contains the class conditional -feature distributions. -\item fmt (default: {\tt text}): Specifies the output format. Choice of -comma-separated values (csv) or as a sparse-matrix (text). -\item probabilities: Location (on HDFS) to store class membership probabilities -for a held-out test set. Note that, this is an optional argument. -\item accuracy: Location (on HDFS) to store the training accuracy during -learning and testing accuracy from a held-out test set during prediction. -Note that, this is an optional argument for prediction. -\item confusion: Location (on HDFS) to store the confusion matrix -computed using a held-out test set. Note that, this is an optional -argument. -\end{itemize} - -\noindent{\bf Details} - -Naive Bayes is a very simple generative classification model. It posits that -given the class label, features can be generated independently of each other. -More precisely, the (multinomial) naive Bayes model uses the following -equation to estimate the joint probability of a feature vector $x$ belonging -to class $y$: -\begin{equation*} -\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)} -\end{equation*} -where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature -present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the -class conditional probability of feature $i$ in class $y$. The usual -constraints hold on $\pi$ and $\theta$: -\begin{eqnarray*} -&& \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\ -\forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1 -\end{eqnarray*} -where $\mathcal{C}$ is the set of classes. - -Given a fully labeled training dataset, it is possible to learn a naive Bayes -model using simple counting (group-by aggregates). To compute the class conditional -probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to -achieve this is using additive smoothing or Laplace smoothing. Some authors have argued -that this should in fact be add-one smoothing. This implementation uses add-one smoothing -by default but lets the user specify her/his own constant, if required. - -This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other -flavours of naive Bayes are also popular. -\\ - -\noindent{\bf Returns} - -The learnt model produced by naive-bayes.dml is stored in two separate files. -The first file stores the class prior (a single-column matrix). The second file -stores the class conditional probabilities organized into a matrix with as many -rows as there are class labels and as many columns as there are features. -Depending on what arguments are provided during invocation, naive-bayes-predict.dml -may compute one or more of probabilities, accuracy and confusion matrix in the -output format specified. -\\ - -\noindent{\bf Examples} - -\begin{verbatim} -hadoop jar SystemML.jar -f naive-bayes.dml -nvargs - X=/user/biadmin/X.mtx - Y=/user/biadmin/y.mtx - laplace=1 fmt=csv - prior=/user/biadmin/prior.csv - conditionals=/user/biadmin/conditionals.csv - accuracy=/user/biadmin/accuracy.csv -\end{verbatim} - -\begin{verbatim} -hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs - X=/user/biadmin/X.mtx - Y=/user/biadmin/y.mtx - prior=/user/biadmin/prior.csv - conditionals=/user/biadmin/conditionals.csv - fmt=csv - accuracy=/user/biadmin/accuracy.csv - probabilities=/user/biadmin/probabilities.csv - confusion=/user/biadmin/confusion.csv -\end{verbatim} - -\noindent{\bf References} - -\begin{itemize} -\item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009. -\item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.} -\newblock AAAI-98 workshop on learning for text categorization, 1998. -\end{itemize} http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/PCA.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/PCA.tex b/Algorithms Reference/PCA.tex deleted file mode 100644 index cef750e..0000000 --- a/Algorithms Reference/PCA.tex +++ /dev/null @@ -1,142 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Principal Component Analysis} -\label{pca} - -\noindent{\bf Description} - -Principal Component Analysis (PCA) is a simple, non-parametric method to transform the given data set with possibly correlated columns into a set of linearly uncorrelated or orthogonal columns, called {\em principal components}. The principal components are ordered in such a way that the first component accounts for the largest possible variance, followed by remaining principal components in the decreasing order of the amount of variance captured from the data. PCA is often used as a dimensionality reduction technique, where the original data is projected or rotated onto a low-dimensional space with basis vectors defined by top-$K$ (for a given value of $K$) principal components. -\\ - -\noindent{\bf Usage} - -\begin{tabbing} -\texttt{-f} \textit{path}/\texttt{PCA.dml -nvargs} -\=\texttt{INPUT=}\textit{path}/\textit{file} - \texttt{K=}\textit{int} \\ -\>\texttt{CENTER=}\textit{0/1} - \texttt{SCALE=}\textit{0/1}\\ -\>\texttt{PROJDATA=}\textit{0/1} - \texttt{OFMT=}\textit{csv}/\textit{text}\\ -\>\texttt{MODEL=}\textit{path}$\vert$\textit{file} - \texttt{OUTPUT=}\textit{path}/\textit{file} -\end{tabbing} - -\noindent{\bf Arguments} - -\begin{itemize} -\item INPUT: Location (on HDFS) to read the input matrix. -\item K: Indicates dimension of the new vector space constructed from $K$ principal components. It must be a value between $1$ and the number of columns in the input data. -\item CENTER (default: {\tt 0}): Indicates whether or not to {\em center} input data prior to the computation of principal components. -\item SCALE (default: {\tt 0}): Indicates whether or not to {\em scale} input data prior to the computation of principal components. -\item PROJDATA: Indicates whether or not the input data must be projected on to new vector space defined over principal components. -\item OFMT (default: {\tt csv}): Specifies the output format. Choice of comma-separated values (csv) or as a sparse-matrix (text). -\item MODEL: Either the location (on HDFS) where the computed model is stored; or the location of an existing model. -\item OUTPUT: Location (on HDFS) to store the data rotated on to the new vector space. -\end{itemize} - -\noindent{\bf Details} - -Principal Component Analysis (PCA) is a non-parametric procedure for orthogonal linear transformation of the input data to a new coordinate system, such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, PCA first selects a normalized direction in $m$-dimensional space ($m$ is the number of columns in the input data) along which the variance in input data is maximized -- this is referred to as the first principal component. It then repeatedly finds other directions (principal components) in which the variance is maximized. At every step, PCA restricts the search for only those directions that are perpendicular to all previously selected directions. By doing so, PCA aims to reduce the redundancy among input variables. To understand the notion of redundancy, consider an extreme scenario with a data set comprising of two v ariables, where the first one denotes some quantity expressed in meters, and the other variable represents the same quantity but in inches. Both these variables evidently capture redundant information, and hence one of them can be removed. In a general scenario, keeping solely the linear combination of input variables would both express the data more concisely and reduce the number of variables. This is why PCA is often used as a dimensionality reduction technique. - -The specific method to compute such a new coordinate system is as follows -- compute a covariance matrix $C$ that measures the strength of correlation among all pairs of variables in the input data; factorize $C$ according to eigen decomposition to calculate its eigenvalues and eigenvectors; and finally, order eigenvectors in the decreasing order of their corresponding eigenvalue. The computed eigenvectors (also known as {\em loadings}) define the new coordinate system and the square root of eigen values provide the amount of variance in the input data explained by each coordinate or eigenvector. -\\ - -%As an example, consider the data in Table~\ref{tab:pca_data}. -\begin{comment} -\begin{table} -\parbox{.35\linewidth}{ -\centering -\begin{tabular}{cc} - \hline - x & y \\ - \hline - 2.5 & 2.4 \\ - 0.5 & 0.7 \\ - 2.2 & 2.9 \\ - 1.9 & 2.2 \\ - 3.1 & 3.0 \\ - 2.3 & 2.7 \\ - 2 & 1.6 \\ - 1 & 1.1 \\ - 1.5 & 1.6 \\ - 1.1 & 0.9 \\ - \hline -\end{tabular} -\caption{Input Data} -\label{tab:pca_data} -} -\hfill -\parbox{.55\linewidth}{ -\centering -\begin{tabular}{cc} - \hline - x & y \\ - \hline - .69 & .49 \\ - -1.31 & -1.21 \\ - .39 & .99 \\ - .09 & .29 \\ - 1.29 & 1.09 \\ - .49 & .79 \\ - .19 & -.31 \\ - -.81 & -.81 \\ - -.31 & -.31 \\ - -.71 & -1.01 \\ - \hline -\end{tabular} -\caption{Data after centering and scaling} -\label{tab:pca_scaled_data} -} -\end{table} -\end{comment} - -\noindent{\bf Returns} -When MODEL is not provided, PCA procedure is applied on INPUT data to generate MODEL as well as the rotated data OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. -The produced model consists of basis vectors MODEL$/dominant.eigen.vectors$ for the new coordinate system; eigen values MODEL$/dominant.eigen.values$; and the standard deviation MODEL$/dominant.eigen.standard.deviations$ of principal components. -When MODEL is provided, INPUT data is rotated according to the coordinate system defined by MODEL$/dominant.eigen.vectors$. The resulting data is stored at location OUTPUT. -\\ - -\noindent{\bf Examples} - -\begin{verbatim} -hadoop jar SystemML.jar -f PCA.dml -nvargs - INPUT=/user/biuser/input.mtx K=10 - CENTER=1 SCALE=1 - OFMT=csv PROJDATA=1 - # location to store model and rotated data - OUTPUT=/user/biuser/pca_output/ -\end{verbatim} - -\begin{verbatim} -hadoop jar SystemML.jar -f PCA.dml -nvargs - INPUT=/user/biuser/test_input.mtx K=10 - CENTER=1 SCALE=1 - OFMT=csv PROJDATA=1 - # location of an existing model - MODEL=/user/biuser/pca_output/ - # location of rotated data - OUTPUT=/user/biuser/test_output.mtx -\end{verbatim} - - - http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/RandomForest.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/RandomForest.tex b/Algorithms Reference/RandomForest.tex deleted file mode 100644 index f9b47f3..0000000 --- a/Algorithms Reference/RandomForest.tex +++ /dev/null @@ -1,215 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Random Forests} -\label{random_forests} - -\noindent{\bf Description} -\smallskip - - -Random forest is one of the most successful machine learning methods for classification and regression. -It is an ensemble learning method that creates a model composed of a set of tree models. -This implementation is well-suited to handle large-scale data and builds a random forest model for classification in parallel.\\ - - -\smallskip -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it% - {\tt{}-f }path/\/{\tt{}random-forest.dml} - {\tt{} -nvargs} - {\tt{} X=}path/file - {\tt{} Y=}path/file - {\tt{} R=}path/file - {\tt{} bins=}integer - {\tt{} depth=}integer - {\tt{} num\_leaf=}integer - {\tt{} num\_samples=}integer - {\tt{} num\_trees=}integer - {\tt{} subsamp\_rate=}double - {\tt{} feature\_subset=}double - {\tt{} impurity=}Gini$\mid$entropy - {\tt{} M=}path/file - {\tt{} C=}path/file - {\tt{} S\_map=}path/file - {\tt{} C\_map=}path/file - {\tt{} fmt=}format - -} - - \smallskip - \noindent{\bf Usage: Prediction} - \smallskip - - {\hangindent=\parindent\noindent\it% - {\tt{}-f }path/\/{\tt{}random-forest-predict.dml} - {\tt{} -nvargs} - {\tt{} X=}path/file - {\tt{} Y=}path/file - {\tt{} R=}path/file - {\tt{} M=}path/file - {\tt{} C=}path/file - {\tt{} P=}path/file - {\tt{} A=}path/file - {\tt{} OOB=}path/file - {\tt{} CM=}path/file - {\tt{} fmt=}format - - }\smallskip - - -\noindent{\bf Arguments} -\begin{Description} - \item[{\tt X}:] - Location (on HDFS) to read the matrix of feature vectors; - each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded. - \item[{\tt Y}:] - Location (on HDFS) to read the matrix of (categorical) - labels that correspond to feature vectors in $X$. Note that classes are assumed to be both recoded and dummy coded. - This argument is optional for prediction. - \item[{\tt R}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column). - If $R$ is not provided by default all features are assumed to be continuous-valued. - \item[{\tt bins}:] (default:\mbox{ }{\tt 20}) - Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning). - \item[{\tt depth}:] (default:\mbox{ }{\tt 25}) - Maximum depth of the learned trees in the random forest model - \item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10}) - Parameter that controls pruning. The tree - is not expanded if a node receives less than {\tt num\_leaf} training examples. - \item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000}) - Parameter that decides when to switch to in-memory building of the subtrees in each tree of the random forest model. - If a node $v$ receives less than {\tt num\_samples} - training examples then this implementation switches to an in-memory subtree - building procedure to build the subtree under $v$ in its entirety. - \item[{\tt num\_trees}:] (default:\mbox{ }{\tt 10}) - Number of trees to be learned in the random forest model - \item[{\tt subsamp\_rate}:] (default:\mbox{ }{\tt 1.0}) - Parameter controlling the size of each tree in the random forest model; samples are selected from a Poisson distribution with parameter {\tt subsamp\_rate}. - \item[{\tt feature\_subset}:] (default:\mbox{ }{\tt 0.5}) - Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of the number of features in the data, i.e., assuming the training set has $D$ features $D^{\tt feature\_subset}$ are used at each tree node. - \item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"}) - Impurity measure used at internal nodes of the trees in the random forest model for selecting which features to split on. Possible value are entropy or Gini. - \item[{\tt M}:] - Location (on HDFS) to write matrix $M$ containing the learned random forest (see Section~\ref{sec:decision_trees} and below for the schema) - \item[{\tt C}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to store the number of counts (generated according to a Poisson distribution with parameter {\tt subsamp\_rate}) for each feature vector. Note that this argument is optional. If Out-Of-Bag (OOB) error estimate needs to be computed this parameter is passed as input to {\tt random-forest-predict.dml}. - \item[{\tt A}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to store the testing accuracy (\%) from a - held-out test set during prediction. Note that this argument is optional. - \item[{\tt OOB}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to store the Out-Of-Bag (OOB) error estimate of the training set. Note that the matrix of sample counts (stored at {\tt C}) needs to be provided for computing OOB error estimate. Note that this argument is optional. - \item[{\tt P}:] - Location (on HDFS) to store predictions for a held-out test set - \item[{\tt CM}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional. - \item[{\tt S\_map}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional. - \item[{\tt C\_map}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional. - \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) - Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; - see read/write functions in SystemML Language Reference for details. -\end{Description} - - - \noindent{\bf Details} - \smallskip - -Random forests~\cite{Breiman01:rforest} are learning algorithms for ensembles of decision trees. -The main idea is to build a number of decision trees on bootstrapped training samples, i.e., by taking repeatedly samples from a (single) training set. -Moreover, instead of considering all the features when building the trees only a random subset of the features---typically $\approx \sqrt{D}$, where $D$ is the number of features---is chosen each time a split test at a tree node is performed. -This procedure {\it decorrelates} the trees and makes it less prone to overfitting. -To build decision trees we utilize the techniques discussed in Section~\ref{sec:decision_trees} proposed in~\cite{PandaHBB09:dtree}; -the implementation details are similar to those of the decision trees script. -Below we review some features of our implementation which differ from {\tt decision-tree.dml}. - - -\textbf{Bootstrapped sampling.} -Each decision tree is fitted to a bootstrapped training set sampled with replacement (WR). -To improve efficiency, we generate $N$ sample counts according to a Poisson distribution with parameter {\tt subsamp\_rate}, -where $N$ denotes the total number of training points. -These sample counts approximate WR sampling when $N$ is large enough and are generated upfront for each decision tree. - - -\textbf{Bagging.} -Decision trees suffer from {\it high variance} resulting in different models whenever trained on a random subsets of the data points. -{\it Bagging} is a general-purpose method to reduce the variance of a statistical learning method like decision trees. -In the context of decision trees (for classification), for a given test feature vector -the prediction is computed by taking a {\it majority vote}: the overall prediction is the most commonly occurring class among all the tree predictions. - - -\textbf{Out-Of-Bag error estimation.} -Note that each bagged tree in a random forest model is trained on a subset (around $\frac{2}{3}$) of the observations (i.e., feature vectors). -The remaining ($\frac{1}{3}$ of the) observations not used for training is called the {\it Out-Of-Bag} (OOB) observations. -This gives us a straightforward way to estimate the test error: to predict the class label of each test observation $i$ we use the trees in which $i$ was OOB. -Our {\tt random-forest-predict.dml} script provides the OOB error estimate for a given training set if requested. - - -\textbf{Description of the model.} -Similar to decision trees, the learned random forest model is presented in a matrix $M$ with at least 7 rows. -The information stored in the model is similar to that of decision trees with the difference that the tree-ids are stored -in the second row and rows $2,3,\ldots$ from the decision tree model are shifted by one. See Section~\ref{sec:decision_trees} for a description of the model. - - -\smallskip -\noindent{\bf Returns} -\smallskip - - -The matrix corresponding to the learned model is written to a file in the format specified. See Section~\ref{sec:decision_trees} where the details about the structure of the model matrix is described. -Similar to {\tt decision-tree.dml}, $X$ is split into $X_\text{cont}$ and $X_\text{cat}$. -If requested, the mappings of the continuous feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) as well as the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided. -The {\tt random-forest-predict.dml} script may compute one or more of -predictions, accuracy, confusion matrix, and OOB error estimate in the requested output format depending on the input arguments used. - - - -\smallskip -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt - \hml -f random-forest.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx - R=/user/biadmin/R.csv M=/user/biadmin/model.csv - bins=20 depth=25 num\_leaf=10 num\_samples=3000 num\_trees=10 impurity=Gini fmt=csv - -}\smallskip - - -\noindent To compute predictions: - -{\hangindent=\parindent\noindent\tt - \hml -f random-forest-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv - M=/user/biadmin/model.csv P=/user/biadmin/predictions.csv - A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv - -}\smallskip - - -%\noindent{\bf References} -% -%\begin{itemize} -%\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009. -%\item L. Breiman. \newblock{Random Forests}. Machine Learning, 45(1), 5--32, 2001. -%\end{itemize} http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/StepGLM.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/StepGLM.tex b/Algorithms Reference/StepGLM.tex deleted file mode 100644 index 3869990..0000000 --- a/Algorithms Reference/StepGLM.tex +++ /dev/null @@ -1,132 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Stepwise Generalized Linear Regression} - -\noindent{\bf Description} -\smallskip - -Our stepwise generalized linear regression script selects a model based on the Akaike information criterion (AIC): the model that gives rise to the lowest AIC is provided. Note that currently only the Bernoulli distribution family is supported (see below for details). \\ - -\smallskip -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it% -{\tt{}-f }path/\/{\tt{}StepGLM.dml} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} Y=}path/file -{\tt{} B=}path/file -{\tt{} S=}path/file -{\tt{} O=}path/file -{\tt{} link=}int -{\tt{} yneg=}double -{\tt{} icpt=}int -{\tt{} tol=}double -{\tt{} disp=}double -{\tt{} moi=}int -{\tt{} mii=}int -{\tt{} thr=}double -{\tt{} fmt=}format - -} - - -\smallskip -\noindent{\bf Arguments} -\begin{Description} - \item[{\tt X}:] - Location (on HDFS) to read the matrix of feature vectors; each row is - an example. - \item[{\tt Y}:] - Location (on HDFS) to read the response matrix, which may have 1 or 2 columns - \item[{\tt B}:] - Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the - intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available - \item[{\tt S}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm, - by default it is standard output. - \item[{\tt O}:] (default:\mbox{ }{\tt " "}) - Location (on HDFS) to write certain summary statistics described in Table~\ref{table:GLM:stats}, - by default it is standard output. - \item[{\tt link}:] (default:\mbox{ }{\tt 2}) - Link function code to determine the link function~$\eta = g(\mu)$, see Table~\ref{table:commonGLMs}; currently the following link functions are supported: \\ - {\tt 1} = log, - {\tt 2} = logit, - {\tt 3} = probit, - {\tt 4} = cloglog. - \item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0}) - Response value for Bernoulli ``No'' label, usually 0.0 or -1.0 - \item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) - Intercept and shifting/rescaling of the features in~$X$:\\ - {\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\ - {\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ - {\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 - \item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}) - Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations - when the deviance changes by less than this factor; see below for details. - \item[{\tt disp}:] (default:\mbox{ }{\tt 0.0}) - Dispersion parameter, or {\tt 0.0} to estimate it from data - \item[{\tt moi}:] (default:\mbox{ }{\tt 200}) - Maximum number of outer (Fisher scoring) iterations - \item[{\tt mii}:] (default:\mbox{ }{\tt 0}) - Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum - limit provided - \item[{\tt thr}:] (default:\mbox{ }{\tt 0.01}) - Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr} - no further features are being checked and the algorithm stops. - \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) - Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; - see read/write functions in SystemML Language Reference for details. -\end{Description} - - -\noindent{\bf Details} -\smallskip - -Similar to {\tt StepLinearRegDS.dml} our stepwise GLM script builds a model by iteratively selecting predictive variables -using a forward selection strategy based on the AIC (\ref{eq:AIC}). -Note that currently only the Bernoulli distribution family ({\tt fam=2} in Table~\ref{table:commonGLMs}) together with the following link functions are supported: log, logit, probit, and cloglog ({\tt link $\in\{1,2,3,4\}$ } in Table~\ref{table:commonGLMs}). - - -\smallskip -\noindent{\bf Returns} -\smallskip - -Similar to the outputs from {\tt GLM.dml} the stepwise GLM script computes the estimated regression coefficients and stores them in matrix $B$ on HDFS; matrix $B$ follows the same format as the one produced by {\tt GLM.dml} (see Section~\ref{sec:GLM}). -Additionally, {\tt StepGLM.dml} outputs the variable indices (stored in the 1-column matrix $S$) in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ stores the variable which improves the AIC the most in $i$th iteration. -If the model with the lowest AIC includes no variables matrix $S$ will be empty. -Moreover, the estimated summary statistics as defined in Table~\ref{table:GLM:stats} -are printed out or stored in a file on HDFS (if requested); -these statistics will be provided only if the selected model is nonempty, i.e., contains at least one variable. - - -\smallskip -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt - \hml -f StepGLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv link=2 yneg=-1.0 icpt=2 tol=0.000001 moi=100 mii=10 thr=0.05 fmt=csv - -} - - http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/StepLinRegDS.tex ---------------------------------------------------------------------- diff --git a/Algorithms Reference/StepLinRegDS.tex b/Algorithms Reference/StepLinRegDS.tex deleted file mode 100644 index 8c29fb1..0000000 --- a/Algorithms Reference/StepLinRegDS.tex +++ /dev/null @@ -1,122 +0,0 @@ -\begin{comment} - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - -\end{comment} - -\subsection{Stepwise Linear Regression} - -\noindent{\bf Description} -\smallskip - -Our stepwise linear regression script selects a linear model based on the Akaike information criterion (AIC): -the model that gives rise to the lowest AIC is computed. \\ - -\smallskip -\noindent{\bf Usage} -\smallskip - -{\hangindent=\parindent\noindent\it% -{\tt{}-f }path/\/{\tt{}StepLinearRegDS.dml} -{\tt{} -nvargs} -{\tt{} X=}path/file -{\tt{} Y=}path/file -{\tt{} B=}path/file -{\tt{} S=}path/file -{\tt{} O=}path/file -{\tt{} icpt=}int -{\tt{} thr=}double -{\tt{} fmt=}format - -} - -\smallskip -\noindent{\bf Arguments} -\begin{Description} -\item[{\tt X}:] -Location (on HDFS) to read the matrix of feature vectors, each row contains -one feature vector. -\item[{\tt Y}:] -Location (on HDFS) to read the 1-column matrix of response values -\item[{\tt B}:] -Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the -intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available -\item[{\tt S}:] (default:\mbox{ }{\tt " "}) -Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm; -by default the selected feature-ids are forwarded to the standard output. -\item[{\tt O}:] (default:\mbox{ }{\tt " "}) -Location (on HDFS) to store the CSV-file of summary statistics defined in -Table~\ref{table:linreg:stats}; by default the summary statistics are forwarded to the standard output. -\item[{\tt icpt}:] (default:\mbox{ }{\tt 0}) -Intercept presence and shifting/rescaling the features in~$X$:\\ -{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\ -{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\ -{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1 -\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01}) -Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr} -no further features are being checked and the algorithm stops. -\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) -Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; -see read/write functions in SystemML Language Reference for details. -\end{Description} - - -\noindent{\bf Details} -\smallskip - -Stepwise linear regression iteratively selects predictive variables in an automated procedure. -Currently, our implementation supports forward selection: starting from an empty model (without any variable) -the algorithm examines the addition of each variable based on the AIC as a model comparison criterion. The AIC is defined as -\begin{equation} -AIC = -2 \log{L} + 2 edf,\label{eq:AIC} -\end{equation} -where $L$ denotes the likelihood of the fitted model and $edf$ is the equivalent degrees of freedom, i.e., the number of estimated parameters. -This procedure is repeated until including no additional variable improves the model by a certain threshold -specified in the input parameter {\tt thr}. - -For fitting a model in each iteration we use the ``direct solve'' method as in the script {\tt LinearRegDS.dml} discussed in Section~\ref{sec:LinReg}. - - -\smallskip -\noindent{\bf Returns} -\smallskip - -Similar to the outputs from {\tt LinearRegDS.dml} the stepwise linear regression script computes -the estimated regression coefficients and stores them in matrix $B$ on HDFS. -The format of matrix $B$ is identical to the one produced by the scripts for linear regression (see Section~\ref{sec:LinReg}). -Additionally, {\tt StepLinearRegDS.dml} outputs the variable indices (stored in the 1-column matrix $S$) -in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ corresponds to -the variable which improves the AIC the most in $i$th iteration. -If the model with the lowest AIC includes no variables matrix $S$ will be empty (contains one 0). -Moreover, the estimated summary statistics as defined in Table~\ref{table:linreg:stats} -are printed out or stored in a file (if requested). -In the case where an empty model achieves the best AIC these statistics will not be produced. - - -\smallskip -\noindent{\bf Examples} -\smallskip - -{\hangindent=\parindent\noindent\tt - \hml -f StepLinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx - B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv - icpt=2 thr=0.05 fmt=csv - -} - - http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/SystemML_Algorithms_Reference.bib ---------------------------------------------------------------------- diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.bib b/Algorithms Reference/SystemML_Algorithms_Reference.bib deleted file mode 100644 index 878e1dc..0000000 --- a/Algorithms Reference/SystemML_Algorithms_Reference.bib +++ /dev/null @@ -1,215 +0,0 @@ - -@article {Lin2008:logistic, - author = {Chih-Jen Lin and Ruby C.\ Weng and S.\ Sathiya Keerthi}, - title = {Trust Region {N}ewton Method for Large-Scale Logistic Regression}, - journal = {Journal of Machine Learning Research}, - month = {April}, - year = {2008}, - volume = {9}, - pages = {627--650} -} - -@book {Agresti2002:CDA, - author = {Alan Agresti}, - title = {Categorical Data Analysis}, - edition = {Second}, - series = {Wiley Series in Probability and Statistics}, - publisher = {Wiley-Interscience}, - year = {2002}, - pages = {710} -} - -@article {Nelder1972:GLM, - author = {John Ashworth Nelder and Robert William Maclagan Wedderburn}, - title = {Generalized Linear Models}, - journal = {Journal of the Royal Statistical Society, Series~A (General)}, - year = {1972}, - volume = {135}, - number = {3}, - pages = {370--384} -} - -@book {McCullagh1989:GLM, - author = {Peter McCullagh and John Ashworth Nelder}, - title = {Generalized Linear Models}, - edition = {Second}, - series = {Monographs on Statistics and Applied Probability}, - number = {37}, - year = {1989}, - publisher = {Chapman~\&~Hall/CRC}, - pages = {532} -} - -@book {Gill2000:GLM, - author = {Jeff Gill}, - title = {Generalized Linear Models: A Unified Approach}, - series = {Sage University Papers Series on Quantitative Applications in the Social Sciences}, - number = {07-134}, - year = {2000}, - publisher = {Sage Publications}, - pages = {101} -} - -@inproceedings {AgrawalKSX2002:hippocratic, - author = {Rakesh Agrawal and Jerry Kiernan and Ramakrishnan Srikant and Yirong Xu}, - title = {Hippocratic Databases}, - booktitle = {Proceedings of the 28-th International Conference on Very Large Data Bases ({VLDB} 2002)}, - address = {Hong Kong, China}, - month = {August 20--23}, - year = {2002}, - pages = {143--154} -} - -@book {Nocedal2006:Optimization, - title = {Numerical Optimization}, - author = {Jorge Nocedal and Stephen Wright}, - series = {Springer Series in Operations Research and Financial Engineering}, - pages = {664}, - edition = {Second}, - publisher = {Springer}, - year = {2006} -} - -@book {Hartigan1975:clustering, - author = {John A.\ Hartigan}, - title = {Clustering Algorithms}, - publisher = {John Wiley~\&~Sons Inc.}, - series = {Probability and Mathematical Statistics}, - month = {April}, - year = {1975}, - pages = {365} -} - -@inproceedings {ArthurVassilvitskii2007:kmeans, - title = {{\tt k-means++}: The Advantages of Careful Seeding}, - author = {David Arthur and Sergei Vassilvitskii}, - booktitle = {Proceedings of the 18th Annual {ACM-SIAM} Symposium on Discrete Algorithms ({SODA}~2007)}, - month = {January 7--9}, - year = {2007}, - address = {New Orleans~{LA}, {USA}}, - pages = {1027--1035} -} - -@article {AloiseDHP2009:kmeans, - author = {Daniel Aloise and Amit Deshpande and Pierre Hansen and Preyas Popat}, - title = {{NP}-hardness of {E}uclidean Sum-of-squares Clustering}, - journal = {Machine Learning}, - publisher = {Kluwer Academic Publishers}, - volume = {75}, - number = {2}, - month = {May}, - year = {2009}, - pages = {245--248} -} - -@article {Cochran1954:chisq, - author = {William G.\ Cochran}, - title = {Some Methods for Strengthening the Common $\chi^2$ Tests}, - journal = {Biometrics}, - volume = {10}, - number = {4}, - month = {December}, - year = {1954}, - pages = {417--451} -} - -@article {AcockStavig1979:CramersV, - author = {Alan C.\ Acock and Gordon R.\ Stavig}, - title = {A Measure of Association for Nonparametric Statistics}, - journal = {Social Forces}, - publisher = {Oxford University Press}, - volume = {57}, - number = {4}, - month = {June}, - year = {1979}, - pages = {1381--1386} -} - -@article {Stevens1946:scales, - author = {Stanley Smith Stevens}, - title = {On the Theory of Scales of Measurement}, - journal = {Science}, - month = {June 7}, - year = {1946}, - volume = {103}, - number = {2684}, - pages = {677--680} -} - -@book{collett2003:kaplanmeier, - title={Modelling Survival Data in Medical Research, Second Edition}, - author={Collett, D.}, - isbn={9781584883258}, - lccn={2003040945}, - series={Chapman \& Hall/CRC Texts in Statistical Science}, - year={2003}, - publisher={Taylor \& Francis} -} - -@article{PetoPABCHMMPS1979:kaplanmeier, - title = {{Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. analysis and examples.}}, - author = {Peto, R. and Pike, M. C. and Armitage, P. and Breslow, N. E. and Cox, D. R. and Howard, S. V. and Mantel, N. and McPherson, K. and Peto, J. and Smith, P. G.}, - journal = {British journal of cancer}, - number = {1}, - pages = {1--39}, - volume = {35}, - year = {1977} -} - -@inproceedings{ZhouWSP08:als, - author = {Yunhong Zhou and - Dennis M. Wilkinson and - Robert Schreiber and - Rong Pan}, - title = {Large-Scale Parallel Collaborative Filtering for the Netflix Prize}, - booktitle = {Algorithmic Aspects in Information and Management, 4th International - Conference, {AAIM} 2008, Shanghai, China, June 23-25, 2008. Proceedings}, - pages = {337--348}, - year = {2008} -} - -@book{BreimanFOS84:dtree, - author = {Leo Breiman and - J. H. Friedman and - R. A. Olshen and - C. J. Stone}, - title = {Classification and Regression Trees}, - publisher = {Wadsworth}, - year = {1984}, - isbn = {0-534-98053-8}, - timestamp = {Thu, 03 Jan 2002 11:51:52 +0100}, - biburl = {http://dblp.uni-trier.de/rec/bib/books/wa/BreimanFOS84}, - bibsource = {dblp computer science bibliography, http://dblp.org} -} - -@article{PandaHBB09:dtree, - author = {Biswanath Panda and - Joshua Herbach and - Sugato Basu and - Roberto J. Bayardo}, - title = {{PLANET:} Massively Parallel Learning of Tree Ensembles with MapReduce}, - journal = {{PVLDB}}, - volume = {2}, - number = {2}, - pages = {1426--1437}, - year = {2009}, - url = {http://www.vldb.org/pvldb/2/vldb09-537.pdf}, - timestamp = {Wed, 02 Sep 2009 09:21:18 +0200}, - biburl = {http://dblp.uni-trier.de/rec/bib/journals/pvldb/PandaHBB09}, - bibsource = {dblp computer science bibliography, http://dblp.org} -} - -@article{Breiman01:rforest, - author = {Leo Breiman}, - title = {Random Forests}, - journal = {Machine Learning}, - volume = {45}, - number = {1}, - pages = {5--32}, - year = {2001}, - url = {http://dx.doi.org/10.1023/A:1010933404324}, - doi = {10.1023/A:1010933404324}, - timestamp = {Thu, 26 May 2011 15:25:18 +0200}, - biburl = {http://dblp.uni-trier.de/rec/bib/journals/ml/Breiman01}, - bibsource = {dblp computer science bibliography, http://dblp.org} -} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/SystemML_Algorithms_Reference.pdf ---------------------------------------------------------------------- diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.pdf b/Algorithms Reference/SystemML_Algorithms_Reference.pdf deleted file mode 100644 index 4087ba5..0000000 Binary files a/Algorithms Reference/SystemML_Algorithms_Reference.pdf and /dev/null differ