Thanks to you too, Johannes, for your comments!

On Tue, Feb 4, 2014 at 7:39 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> Thanks Ted!
>
> Would indeed be a nice example to add.
>
>
> On Tue, Feb 4, 2014 at 10:40 AM, Ted Dunning <ted.dunn...@gmail.com>wrote:
>
>> Yes.
>>
>>
>> On Tue, Feb 4, 2014 at 1:31 AM, Sebastian Schelter <s...@apache.org>
>> wrote:
>>
>> > Would be great to add this as an example to Mahout's codebase.
>> >
>> >
>> > On 02/04/2014 10:27 AM, Ted Dunning wrote:
>> >
>> >> Frank,
>> >>
>> >> I just munched on your code and sent a pull request.
>> >>
>> >> In doing this, I made a bunch of changes.  Hope you liked them.
>> >>
>> >> These include massive simplification of the reading and vectorization.
>> >>   This wasn't strictly necessary, but it seemed like a good idea.
>> >>
>> >> More important was the way that I changed the vectorization.  For the
>> >> continuous values, I added log transforms.  For the categorical
>> values, I
>> >> encoded as they are.  I also increased the feature vector size to 100
>> to
>> >> avoid excessive collisions.
>> >>
>> >> In the learning code itself, I got rid of the use of index arrays in
>> favor
>> >> of shuffling the training data itself.  I also tuned the learning
>> >> parameters a lot.
>> >>
>> >> The result is that the AUC that results is just a tiny bit less than
>> 0.9
>> >> which is pretty close to what I got in R.
>> >>
>> >> For everybody else, see
>> >> https://github.com/tdunning/mahout-sgd-bank-marketing for my version
>> and
>> >> https://github.com/tdunning/mahout-sgd-bank-marketing/
>> >> compare/frankscholten:master...masterfor
>> >> my pull request.
>> >>
>> >>
>> >>
>> >> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunn...@gmail.com>
>> >> wrote:
>> >>
>> >>
>> >>> Johannes,
>> >>>
>> >>> Very good comments.
>> >>>
>> >>> Frank,
>> >>>
>> >>> As a benchmark, I just spent a few minutes building a logistic
>> regression
>> >>> model using R.  For this model AUC on 10% held-out data is about 0.9.
>> >>>
>> >>> Here is a gist summarizing the results:
>> >>>
>> >>> https://gist.github.com/tdunning/8794734
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
>> >>> johannes.schu...@gmail.com> wrote:
>> >>>
>> >>>  Hi Frank,
>> >>>>
>> >>>> you are using the feature vector encoders which hash a combination of
>> >>>> feature name and feature value to 2 (default) locations in the
>> vector.
>> >>>> The
>> >>>> vector size you configured is 11 and this is imo very small to the
>> >>>> possible
>> >>>> combination of values you have for your data (education, marital,
>> >>>> campaign). You can do no harm by using a much bigger cardinality (try
>> >>>> 1000).
>> >>>>
>> >>>> Second, you are using a continuous value encoder with passing in the
>> >>>> weight
>> >>>> your are using as string (e.g. variable "pDays"). I am not quite sure
>> >>>> about
>> >>>> the reasons in th mahout code right now but the way it is implemented
>> >>>> now,
>> >>>> every unique value should end up in a different location because the
>> >>>> continuous value is part of the hashing. Try adding the weight
>> directly
>> >>>> using a static word value encoder, addToVector("pDays",v,pDays)
>> >>>>
>> >>>> Last, you are also putting in the variable "campaign" as a continous
>> >>>> variable which should be probably a categorical variable, so just
>> added
>> >>>> with a StaticWorldValueEncoder.
>> >>>>
>> >>>> And finally and probably most important after looking at your target
>> >>>> variable: you are using a Dictionary for mapping either y or no to 0
>> or
>> >>>> 1.
>> >>>> This is bad. Depending on what comes first in the data set, either a
>> >>>> positive or negative example might be 0 or 1, totally random. Make a
>> >>>> hard
>> >>>> mapping from the possible values (y/n?) to zero and one, having yes
>> the
>> >>>> 1
>> >>>> and no the zero.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <
>> fr...@frankscholten.nl
>> >>>>
>> >>>>> wrote:
>> >>>>>
>> >>>>
>> >>>>  Hi all,
>> >>>>>
>> >>>>> I am exploring Mahout's SGD classifier and like some feedback
>> because I
>> >>>>> think I didn't properly configure things.
>> >>>>>
>> >>>>> I created an example app that trains an SGD classifier on the 'bank
>> >>>>> marketing' dataset from UCI:
>> >>>>> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
>> >>>>>
>> >>>>> My app is at:
>> >>>>>
>> >>>> https://github.com/frankscholten/mahout-sgd-bank-marketing
>> >>>>
>> >>>>>
>> >>>>> The app reads a CSV file of telephone calls, encodes the features
>> into
>> >>>>> a
>> >>>>> vector and tries to predict whether a customer answers yes to a
>> >>>>> business
>> >>>>> proposal.
>> >>>>>
>> >>>>> I do a few runs and measure accuracy but I'm I don't trust the
>> results.
>> >>>>> When I only use an intercept term as a feature I get around 88%
>> >>>>> accuracy
>> >>>>> and when I add all features it drops to around 85%. Is this perhaps
>> >>>>>
>> >>>> because
>> >>>>
>> >>>>> the dataset highly unbalanced? Most customers answer no. Or is the
>> >>>>> classifier biased to predict 0 as the target code when it doesn't
>> have
>> >>>>>
>> >>>> any
>> >>>>
>> >>>>> data to go with?
>> >>>>>
>> >>>>> Any other comments about my code or improvements I can make in the
>> app
>> >>>>>
>> >>>> are
>> >>>>
>> >>>>> welcome! :)
>> >>>>>
>> >>>>> Cheers,
>> >>>>>
>> >>>>> Frank
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>>
>
>

Reply via email to