Would be great to add this as an example to Mahout's codebase.

On 02/04/2014 10:27 AM, Ted Dunning wrote:
Frank,

I just munched on your code and sent a pull request.

In doing this, I made a bunch of changes.  Hope you liked them.

These include massive simplification of the reading and vectorization.
  This wasn't strictly necessary, but it seemed like a good idea.

More important was the way that I changed the vectorization.  For the
continuous values, I added log transforms.  For the categorical values, I
encoded as they are.  I also increased the feature vector size to 100 to
avoid excessive collisions.

In the learning code itself, I got rid of the use of index arrays in favor
of shuffling the training data itself.  I also tuned the learning
parameters a lot.

The result is that the AUC that results is just a tiny bit less than 0.9
which is pretty close to what I got in R.

For everybody else, see
https://github.com/tdunning/mahout-sgd-bank-marketing for my version and
https://github.com/tdunning/mahout-sgd-bank-marketing/compare/frankscholten:master...masterfor
my pull request.



On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:


Johannes,

Very good comments.

Frank,

As a benchmark, I just spent a few minutes building a logistic regression
model using R.  For this model AUC on 10% held-out data is about 0.9.

Here is a gist summarizing the results:

https://gist.github.com/tdunning/8794734




On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

Hi Frank,

you are using the feature vector encoders which hash a combination of
feature name and feature value to 2 (default) locations in the vector. The
vector size you configured is 11 and this is imo very small to the
possible
combination of values you have for your data (education, marital,
campaign). You can do no harm by using a much bigger cardinality (try
1000).

Second, you are using a continuous value encoder with passing in the
weight
your are using as string (e.g. variable "pDays"). I am not quite sure
about
the reasons in th mahout code right now but the way it is implemented now,
every unique value should end up in a different location because the
continuous value is part of the hashing. Try adding the weight directly
using a static word value encoder, addToVector("pDays",v,pDays)

Last, you are also putting in the variable "campaign" as a continous
variable which should be probably a categorical variable, so just added
with a StaticWorldValueEncoder.

And finally and probably most important after looking at your target
variable: you are using a Dictionary for mapping either y or no to 0 or 1.
This is bad. Depending on what comes first in the data set, either a
positive or negative example might be 0 or 1, totally random. Make a hard
mapping from the possible values (y/n?) to zero and one, having yes the 1
and no the zero.





On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <fr...@frankscholten.nl
wrote:

Hi all,

I am exploring Mahout's SGD classifier and like some feedback because I
think I didn't properly configure things.

I created an example app that trains an SGD classifier on the 'bank
marketing' dataset from UCI:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

My app is at:
https://github.com/frankscholten/mahout-sgd-bank-marketing

The app reads a CSV file of telephone calls, encodes the features into a
vector and tries to predict whether a customer answers yes to a business
proposal.

I do a few runs and measure accuracy but I'm I don't trust the results.
When I only use an intercept term as a feature I get around 88% accuracy
and when I add all features it drops to around 85%. Is this perhaps
because
the dataset highly unbalanced? Most customers answer no. Or is the
classifier biased to predict 0 as the target code when it doesn't have
any
data to go with?

Any other comments about my code or improvements I can make in the app
are
welcome! :)

Cheers,

Frank






Reply via email to