As you point out, there are two ways to encode categorical variables. One way has (n-1) binary variables for a categorical variable with n possible values. The other has n binary variables.
The singularity problem that you mention definitely occurs. It comes from the fact that we now have (n+1) variables with, effectively, only n constraints. With no other information, the problem becomes under-determined which leads to singularity in the numerical solution if you use a second order method or unbounded wandering if you are using stochastic gradient descent. In large problems, however, the problem of having lots of variables for only limited amounts of data is pretty ubiquitous. In fact, it is common to have more variables than observations, possibly vastly more. It is also common that many of these variables are essentially restatements of other variables as well. This means several things. 1 - using direct (1 of n) encoding or contrast encoding (1 of n-1) has no difference regarding the under-determined nature of the problem 2 - you have to use some method for dealing with under-determined systems to deal with the too many variables, too little data problem and variable selection isn't going to work 3 - you have to build in solutions for co-linearity as well. The answer here is to use some kind of regularization. For logistic regression, I strongly recommend that you try out L1 (the Lasso technique) or a combination of L1 and L2 (elastic band) regularization. In R, the best library I have found for this is glmnet. One particular benefit of glmnet is that it handles sparse matrices well. The SGD implementation in Mahout also supports L1 or L1+L2 regularization quite easily. I wouldn't call that implementation state of the art, but it may do the job for you. If your problem will fit into glmnet, that is a great option. If it is too large for R, consider H2O's solvers. On Mon, Sep 22, 2014 at 11:43 AM, Aymen J <a...@hotmail.fr> wrote: > Hi List, > I'm using Mahout Logistic Regression for a prediction task. As a test, I > try the classification task with one single feature, a categorical one with > 26 levels. > When I run the Logistic regression on R or Python, I expect 25 > coefficients (corresponding to 25 out of the 26 levels, due to the > "contrast coding") + the intercept. However, when I run it on Mahout, I > have 26 coefficients + the intercept. Is there any way to force the > contrast coding on Mahout (i.e. consider one of the level as the default > level)? Isn't there a risk of matrix singularity by considering the 26 > levels in the logistic regression? > Let me know if it's not clear.Thanks in advance for your answers, > Aymen