Yes.
On Tue, Feb 4, 2014 at 1:31 AM, Sebastian Schelter <s...@apache.org> wrote: > Would be great to add this as an example to Mahout's codebase. > > > On 02/04/2014 10:27 AM, Ted Dunning wrote: > >> Frank, >> >> I just munched on your code and sent a pull request. >> >> In doing this, I made a bunch of changes. Hope you liked them. >> >> These include massive simplification of the reading and vectorization. >> This wasn't strictly necessary, but it seemed like a good idea. >> >> More important was the way that I changed the vectorization. For the >> continuous values, I added log transforms. For the categorical values, I >> encoded as they are. I also increased the feature vector size to 100 to >> avoid excessive collisions. >> >> In the learning code itself, I got rid of the use of index arrays in favor >> of shuffling the training data itself. I also tuned the learning >> parameters a lot. >> >> The result is that the AUC that results is just a tiny bit less than 0.9 >> which is pretty close to what I got in R. >> >> For everybody else, see >> https://github.com/tdunning/mahout-sgd-bank-marketing for my version and >> https://github.com/tdunning/mahout-sgd-bank-marketing/ >> compare/frankscholten:master...masterfor >> my pull request. >> >> >> >> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> >> >>> Johannes, >>> >>> Very good comments. >>> >>> Frank, >>> >>> As a benchmark, I just spent a few minutes building a logistic regression >>> model using R. For this model AUC on 10% held-out data is about 0.9. >>> >>> Here is a gist summarizing the results: >>> >>> https://gist.github.com/tdunning/8794734 >>> >>> >>> >>> >>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte < >>> johannes.schu...@gmail.com> wrote: >>> >>> Hi Frank, >>>> >>>> you are using the feature vector encoders which hash a combination of >>>> feature name and feature value to 2 (default) locations in the vector. >>>> The >>>> vector size you configured is 11 and this is imo very small to the >>>> possible >>>> combination of values you have for your data (education, marital, >>>> campaign). You can do no harm by using a much bigger cardinality (try >>>> 1000). >>>> >>>> Second, you are using a continuous value encoder with passing in the >>>> weight >>>> your are using as string (e.g. variable "pDays"). I am not quite sure >>>> about >>>> the reasons in th mahout code right now but the way it is implemented >>>> now, >>>> every unique value should end up in a different location because the >>>> continuous value is part of the hashing. Try adding the weight directly >>>> using a static word value encoder, addToVector("pDays",v,pDays) >>>> >>>> Last, you are also putting in the variable "campaign" as a continous >>>> variable which should be probably a categorical variable, so just added >>>> with a StaticWorldValueEncoder. >>>> >>>> And finally and probably most important after looking at your target >>>> variable: you are using a Dictionary for mapping either y or no to 0 or >>>> 1. >>>> This is bad. Depending on what comes first in the data set, either a >>>> positive or negative example might be 0 or 1, totally random. Make a >>>> hard >>>> mapping from the possible values (y/n?) to zero and one, having yes the >>>> 1 >>>> and no the zero. >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <fr...@frankscholten.nl >>>> >>>>> wrote: >>>>> >>>> >>>> Hi all, >>>>> >>>>> I am exploring Mahout's SGD classifier and like some feedback because I >>>>> think I didn't properly configure things. >>>>> >>>>> I created an example app that trains an SGD classifier on the 'bank >>>>> marketing' dataset from UCI: >>>>> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing >>>>> >>>>> My app is at: >>>>> >>>> https://github.com/frankscholten/mahout-sgd-bank-marketing >>>> >>>>> >>>>> The app reads a CSV file of telephone calls, encodes the features into >>>>> a >>>>> vector and tries to predict whether a customer answers yes to a >>>>> business >>>>> proposal. >>>>> >>>>> I do a few runs and measure accuracy but I'm I don't trust the results. >>>>> When I only use an intercept term as a feature I get around 88% >>>>> accuracy >>>>> and when I add all features it drops to around 85%. Is this perhaps >>>>> >>>> because >>>> >>>>> the dataset highly unbalanced? Most customers answer no. Or is the >>>>> classifier biased to predict 0 as the target code when it doesn't have >>>>> >>>> any >>>> >>>>> data to go with? >>>>> >>>>> Any other comments about my code or improvements I can make in the app >>>>> >>>> are >>>> >>>>> welcome! :) >>>>> >>>>> Cheers, >>>>> >>>>> Frank >>>>> >>>>> >>>> >>> >>> >> >