Frank, I just munched on your code and sent a pull request.
In doing this, I made a bunch of changes. Hope you liked them. These include massive simplification of the reading and vectorization. This wasn't strictly necessary, but it seemed like a good idea. More important was the way that I changed the vectorization. For the continuous values, I added log transforms. For the categorical values, I encoded as they are. I also increased the feature vector size to 100 to avoid excessive collisions. In the learning code itself, I got rid of the use of index arrays in favor of shuffling the training data itself. I also tuned the learning parameters a lot. The result is that the AUC that results is just a tiny bit less than 0.9 which is pretty close to what I got in R. For everybody else, see https://github.com/tdunning/mahout-sgd-bank-marketing for my version and https://github.com/tdunning/mahout-sgd-bank-marketing/compare/frankscholten:master...masterfor my pull request. On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > Johannes, > > Very good comments. > > Frank, > > As a benchmark, I just spent a few minutes building a logistic regression > model using R. For this model AUC on 10% held-out data is about 0.9. > > Here is a gist summarizing the results: > > https://gist.github.com/tdunning/8794734 > > > > > On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte < > johannes.schu...@gmail.com> wrote: > >> Hi Frank, >> >> you are using the feature vector encoders which hash a combination of >> feature name and feature value to 2 (default) locations in the vector. The >> vector size you configured is 11 and this is imo very small to the >> possible >> combination of values you have for your data (education, marital, >> campaign). You can do no harm by using a much bigger cardinality (try >> 1000). >> >> Second, you are using a continuous value encoder with passing in the >> weight >> your are using as string (e.g. variable "pDays"). I am not quite sure >> about >> the reasons in th mahout code right now but the way it is implemented now, >> every unique value should end up in a different location because the >> continuous value is part of the hashing. Try adding the weight directly >> using a static word value encoder, addToVector("pDays",v,pDays) >> >> Last, you are also putting in the variable "campaign" as a continous >> variable which should be probably a categorical variable, so just added >> with a StaticWorldValueEncoder. >> >> And finally and probably most important after looking at your target >> variable: you are using a Dictionary for mapping either y or no to 0 or 1. >> This is bad. Depending on what comes first in the data set, either a >> positive or negative example might be 0 or 1, totally random. Make a hard >> mapping from the possible values (y/n?) to zero and one, having yes the 1 >> and no the zero. >> >> >> >> >> >> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <fr...@frankscholten.nl >> >wrote: >> >> > Hi all, >> > >> > I am exploring Mahout's SGD classifier and like some feedback because I >> > think I didn't properly configure things. >> > >> > I created an example app that trains an SGD classifier on the 'bank >> > marketing' dataset from UCI: >> > http://archive.ics.uci.edu/ml/datasets/Bank+Marketing >> > >> > My app is at: >> https://github.com/frankscholten/mahout-sgd-bank-marketing >> > >> > The app reads a CSV file of telephone calls, encodes the features into a >> > vector and tries to predict whether a customer answers yes to a business >> > proposal. >> > >> > I do a few runs and measure accuracy but I'm I don't trust the results. >> > When I only use an intercept term as a feature I get around 88% accuracy >> > and when I add all features it drops to around 85%. Is this perhaps >> because >> > the dataset highly unbalanced? Most customers answer no. Or is the >> > classifier biased to predict 0 as the target code when it doesn't have >> any >> > data to go with? >> > >> > Any other comments about my code or improvements I can make in the app >> are >> > welcome! :) >> > >> > Cheers, >> > >> > Frank >> > >> > >