Frank,

I just munched on your code and sent a pull request.

In doing this, I made a bunch of changes.  Hope you liked them.

These include massive simplification of the reading and vectorization.
 This wasn't strictly necessary, but it seemed like a good idea.

More important was the way that I changed the vectorization.  For the
continuous values, I added log transforms.  For the categorical values, I
encoded as they are.  I also increased the feature vector size to 100 to
avoid excessive collisions.

In the learning code itself, I got rid of the use of index arrays in favor
of shuffling the training data itself.  I also tuned the learning
parameters a lot.

The result is that the AUC that results is just a tiny bit less than 0.9
which is pretty close to what I got in R.

For everybody else, see
https://github.com/tdunning/mahout-sgd-bank-marketing for my version and
https://github.com/tdunning/mahout-sgd-bank-marketing/compare/frankscholten:master...masterfor
my pull request.



On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

>
> Johannes,
>
> Very good comments.
>
> Frank,
>
> As a benchmark, I just spent a few minutes building a logistic regression
> model using R.  For this model AUC on 10% held-out data is about 0.9.
>
> Here is a gist summarizing the results:
>
> https://gist.github.com/tdunning/8794734
>
>
>
>
> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
>
>> Hi Frank,
>>
>> you are using the feature vector encoders which hash a combination of
>> feature name and feature value to 2 (default) locations in the vector. The
>> vector size you configured is 11 and this is imo very small to the
>> possible
>> combination of values you have for your data (education, marital,
>> campaign). You can do no harm by using a much bigger cardinality (try
>> 1000).
>>
>> Second, you are using a continuous value encoder with passing in the
>> weight
>> your are using as string (e.g. variable "pDays"). I am not quite sure
>> about
>> the reasons in th mahout code right now but the way it is implemented now,
>> every unique value should end up in a different location because the
>> continuous value is part of the hashing. Try adding the weight directly
>> using a static word value encoder, addToVector("pDays",v,pDays)
>>
>> Last, you are also putting in the variable "campaign" as a continous
>> variable which should be probably a categorical variable, so just added
>> with a StaticWorldValueEncoder.
>>
>> And finally and probably most important after looking at your target
>> variable: you are using a Dictionary for mapping either y or no to 0 or 1.
>> This is bad. Depending on what comes first in the data set, either a
>> positive or negative example might be 0 or 1, totally random. Make a hard
>> mapping from the possible values (y/n?) to zero and one, having yes the 1
>> and no the zero.
>>
>>
>>
>>
>>
>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <fr...@frankscholten.nl
>> >wrote:
>>
>> > Hi all,
>> >
>> > I am exploring Mahout's SGD classifier and like some feedback because I
>> > think I didn't properly configure things.
>> >
>> > I created an example app that trains an SGD classifier on the 'bank
>> > marketing' dataset from UCI:
>> > http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
>> >
>> > My app is at:
>> https://github.com/frankscholten/mahout-sgd-bank-marketing
>> >
>> > The app reads a CSV file of telephone calls, encodes the features into a
>> > vector and tries to predict whether a customer answers yes to a business
>> > proposal.
>> >
>> > I do a few runs and measure accuracy but I'm I don't trust the results.
>> > When I only use an intercept term as a feature I get around 88% accuracy
>> > and when I add all features it drops to around 85%. Is this perhaps
>> because
>> > the dataset highly unbalanced? Most customers answer no. Or is the
>> > classifier biased to predict 0 as the target code when it doesn't have
>> any
>> > data to go with?
>> >
>> > Any other comments about my code or improvements I can make in the app
>> are
>> > welcome! :)
>> >
>> > Cheers,
>> >
>> > Frank
>> >
>>
>
>

Reply via email to