Hello Ted, thanks for the fast reply. Maybe I expressed myself not clearly. In the first case (n mutually exclusive classes) classify and the current implementation ofclassifyFullin AbstractVectorClassfier make sense. The implementation use the assumption sum_i p_i = 1. Here the assumption is valid. But in the second case (n independent decision) only classifyFull(..) can be applied, because sum_i p_i = 1 (p_i probability of class i) doesn't apply. That's what I wanted to express by "makes no sense". Simultaneouslysolving this kind of problem with one classifier is specific for mlp neural networks. E.g. solving this kind of problem with support vector machines (svm) you will train n svms independently. This kind of problem is interesting for autoencoder or feature learning. The activation pattern of the last hidden units can be interpreted as a feature representation of the input pattern. With this feature representation the n independent problems can be solved with a linear model. The mapping of this feature representation (last hidden units) to the output units is the same as logistic regression. Here the activation function of the output is sigmoid (logistic function). In the fist case (n mutually exclusive classes) as activation function softmax is used. In the softmax formula there exist a normalization factor, which guarantees that the sum over all outputs is 1 (sum_i p_i = 1). For each of the cases there exists a cost function which should be used. The boolean flag "mutuallyExclusiveClasses" is just a switch between the two classification cases. So the user doesn't need to know which activation function and which cost functions he should use for his problem. Depending on the problem it will be chosen automatically: cost function and corresponding activation function of the output units (conjugate link functions) - classification (n independent classes): activation function: Sigmoid (logistic) - cost function: cross entropy: sum t_i ln y_i +(1-t_i) ln (1-y_i) - classification () : activation function: softmax -- cost function: cross entropy: sum_i t_i ln y_i - (to be complete) the regression case: cost function: Sum of squared errors; activation function: identity (no squashing) I hope now I expressed myself more clearly. Cheers Christian
Ted Dunning <ted.dunn...@gmail.com> hat am 12. Februar 2012 um 15:56 geschrieben: > On Sun, Feb 12, 2012 at 5:14 AM, Christian Herta (Commented) (JIRA) < > j...@apache.org> wrote: > > > .... > > The implementation of public Vector classifyFull(Vector r, Vector > > instance) in AbstractVectorClassifier assumes that the probabilities of > > the n elements of the output vector sum to 1. This is only valid if there > > are n mutually exclusive classes. e.g. for the target vectors like (0 0 1 > > 0), (0 0 0 1), (1 0 0 0), .... > > > > Fine. That assumption is based on the fact that we only really had > classifiers that had this property. Over-ride it and comment that the > assumption doesn't hold. > > > > The other posibility is, that there are n (here 4)independent targets > > like: (1 0 0 1), (0 0 0 0), (0 1 1 1), (1 1 1 1) > > Here the method "Vector classify(..)" and the implementation "public > > Vector classifyFull(Vector r, Vector instance)" of AbstractVectorClassfier > > makes no sense. Therefore using "Vector classify(..)" should throw an > > exception and "Vector classifyFull" must be overwritten. > > > > The method classify makes a lot of sense. ClassifyFull becomes the > primitive and classify() just adds a maxIndex to find the largest value. It > is true that finding the largest value doesn't make sense for some > problems, but you can say the same thing of addition. The classify() > method definitely does make sense for some problems. > > > > P.S.: Depending on the "flag" the cost function and the activation > > function for the output units will be set, to get probabilities as outputs > > e.g. C. Bishop: "Pattern Recognition and Machine Learning", chapter 5.2. > > > I am on the road and I don't have my copy of Bishop handy and others > haven't read it. > > Do you mean you will offset the activation function to avoid negative > values and L_1 normalize the result? > > > > Also, this simplifies the implementation because the natural pairing > > between cost and activation function yields for the output deltas "y - t". > > > > This sounds like an implementation detail. Implementation details should > not be exposed to users, even indirectly. If there is a user expectation > of a certain behavior, then it is fine to expose some behavior. But if the > user expectation conflicts with the simple implementation then you really > need to do the translation internally so that the user has the easier time > of it.