Re: [jira] [Commented] (MAHOUT-976) Implement Multilayer Perceptron

Ted Dunning Sun, 12 Feb 2012 19:44:30 -0800

Christian,

All of what you say makes reasonable sense, but I think that you put too
much weight on the current uses of the API which are warped by the initial
logistic regression implementation.


The heart is classifyFull.  It returns scores which by convention are large
for the 1-of-n category for 1-of-n problems.  Whether these scores
represent a discrete distribution is not specified.

The rest of the classify* methods are convenience functions which may
allocate less memory or require less code on the part of the caller.  For
instance, in binary logistic regression, returning a single score is
sufficient.  Similarly, getting back n-1 scores may be slightly cheaper
than returning all n (for logistic regression).

With MLP, the classifyFull call makes lots of sense.  Whether you normalize
and return a distribution is your business.  It is nice to be as flexible
as you say.  It is also nice to have the convenience method that picks the
largest score.  If you are providing scores that are probabilities, then it
makes folks lives a bit more familiar if you support the methods that
return n-1 scores, but throwing Unsupported Method is probably just fine as
well.

I really think that you are worrying too much here.


On Sun, Feb 12, 2012 at 8:11 AM, Herta, Christian <
christian.he...@htw-berlin.de> wrote:

> Hello Ted,
>
> thanks for the fast reply.
> Maybe I expressed myself not clearly. In the first case (n mutually
> exclusive
> classes) classify and the current implementation ofclassifyFullin
> AbstractVectorClassfier make sense. The implementation use the assumption
> sum_i
> p_i = 1. Here the assumption is valid.
>
> But in the second case (n independent decision) only classifyFull(..) can
> be
> applied, because sum_i p_i = 1 (p_i probability of class i) doesn't apply.
> That's what I wanted to express by "makes no sense".
> Simultaneouslysolving this kind of problem with one classifier is specific
> for
> mlp neural networks. E.g. solving this kind of problem with support vector
> machines (svm) you will train n svms independently.
> This kind of problem is interesting for autoencoder or feature learning.
> The
> activation pattern of the last hidden units can be interpreted as a feature
> representation of the input pattern. With this feature representation the n
> independent problems can be solved with a linear model. The mapping of this
> feature representation (last hidden units) to the output units is the same
> as
> logistic regression.
> Here the activation function of the output is sigmoid (logistic function).
> In the fist case (n mutually exclusive classes) as activation function
> softmax
> is used. In the softmax formula there exist a normalization factor, which
> guarantees that the sum over all outputs is 1 (sum_i p_i = 1).
> For each of the cases there exists a cost function which should be used.
>
> The boolean flag "mutuallyExclusiveClasses" is just a switch between the
> two
> classification cases. So the user doesn't need to know which activation
> function
> and which cost functions he should use for his problem. Depending on the
> problem
> it will be chosen automatically:
>  cost function and corresponding activation function of the output units
> (conjugate link functions)
>  - classification (n independent classes): activation function: Sigmoid
> (logistic) - cost function: cross entropy: sum t_i ln y_i +(1-t_i) ln
> (1-y_i)
>  - classification () : activation function: softmax -- cost function: cross
> entropy: sum_i t_i ln y_i
>  - (to be complete) the regression case: cost function: Sum of squared
> errors;
> activation function: identity (no squashing)
>
> I hope now I expressed myself more clearly.
>
> Cheers
>  Christian
>
>
>
> Ted Dunning <ted.dunn...@gmail.com> hat am 12. Februar 2012 um 15:56
> geschrieben:
>
> > On Sun, Feb 12, 2012 at 5:14 AM, Christian Herta (Commented) (JIRA) <
> > j...@apache.org> wrote:
> >
> > > ....
> > > The implementation of public Vector classifyFull(Vector r, Vector
> > > instance)  in AbstractVectorClassifier assumes that the probabilities
> of
> > > the n elements of the output vector sum to 1. This is only valid if
> there
> > > are n mutually exclusive classes. e.g. for the target vectors like (0
> 0 1
> > > 0), (0 0 0 1), (1 0 0 0), ....
> > >
> >
> > Fine.  That assumption is based on the fact that we only really had
> > classifiers that had this property.  Over-ride it and comment that the
> > assumption doesn't hold.
> >
> >
> > > The other posibility is, that there are n (here 4)independent targets
> > > like: (1 0 0 1), (0 0 0 0), (0 1 1 1), (1 1 1 1)
> > > Here the method "Vector classify(..)" and the implementation "public
> > > Vector classifyFull(Vector r, Vector instance)"  of
> AbstractVectorClassfier
> > > makes no sense. Therefore using "Vector classify(..)" should throw an
> > > exception and "Vector classifyFull" must be overwritten.
> > >
> >
> > The method classify makes a lot of sense.  ClassifyFull becomes the
> > primitive and classify() just adds a maxIndex to find the largest value.
> It
> > is true that finding the largest value doesn't make sense for some
> > problems, but you can say the same thing of addition.  The classify()
> > method definitely does make sense for some problems.
> >
> >
> > > P.S.: Depending on the "flag" the cost function and the activation
> > > function for the output units will be set, to get probabilities as
> outputs
> > > e.g. C. Bishop: "Pattern Recognition and Machine Learning", chapter
> 5.2.
> >
> >
> > I am on the road and I don't have my copy of Bishop handy and others
> > haven't read it.
> >
> > Do you mean you will offset the activation function to avoid negative
> > values and L_1 normalize the result?
> >
> >
> > > Also, this simplifies the implementation because the natural pairing
> > > between cost and activation function yields for the output deltas "y -
> t".
> > >
> >
> > This sounds like an implementation detail.  Implementation details should
> > not be exposed to users, even indirectly.  If there is a user expectation
> > of a certain behavior, then it is fine to expose some behavior.  But if
> the
> > user expectation conflicts with the simple implementation then you really
> > need to do the translation internally so that the user has the easier
> time
> > of it.

Re: [jira] [Commented] (MAHOUT-976) Implement Multilayer Perceptron

Reply via email to