2013/7/6 Issam <[email protected]>:
> Hi Lars, How are you?
Please send this kind of stuff to the ML so that people who are more
knowledgeable than me can read and comment on your concerns. Comments
below.
> I'm Issam, thanks a lot for your great feedback on the MLP code, and sorry
> about my mistakes, I learned new things reading your comments, but I have
> few concerns about the backpropagation procedure. It would be great to get
> some help in them.
>
> Let's take the multi-class classification example, I believe that the
> forward pass is assembled as follows,
>
> a_hidden[:] = self.activation(safe_sparse_dot(X, self.coef_hidden_) +
> self.intercept_hidden_)
>
> a_output[:] = self.output_func(safe_sparse_dot(a_hidden, self.coef_output_)
> +
> self.intercept_output_)
>
> Where `self.activation` is the hyperbolic tangent, and `self.output_func`
> is the softmax function.
>
> Now for the backpropagation part, I believe that regardless of the loss
> function the non-regularized gradient is computed as follows,
>
> diff = Y - a_output
> delta_output[:] = -diff
> delta_hidden[:] = np.dot(delta_output, self.coef_output_.T) *
> self.derivative(a_hidden)
(Side remark: np.dot allocates memory for its result, which you then
copy into preallocated memory. Just delta_hidden = would be fine
here.)
> W1grad = safe_sparse_dot(X.T, delta_h) /n_samples
> W2grad = safe_sparse_dot(a_hidden.T, delta_o) /n_samples
> b1grad = np.mean(delta_h, 0)
> b2grad = np.mean(delta_o, 0)
>
> However, the cost aka loss calculated by the square error, suitable for
> regression, is
>
> cost = np.sum(np.einsum('ij,ji->i', diff, diff.T)) / (2 * n_samples)
I find einsum hard to read. It's also not in NumPy 1.3, which
scikit-learn targets. Use np.dot. Apart from that, I guess this is
right.
> And for `cross-entropy` the cost would be the 'log function', suitable for
> classificaiton
>
> What is confusing me is the fact that the weights get updated via the
> gradient only (without the cost), that is,
>
> self.coef_hidden_ -= (self.learning_rate * W1grad)
> self.coef_output_ -= (self.learning_rate * W2grad)
> self.intercept_hidden_ -= (self.learning_rate * b1grad)
> self.intercept_output_ -= (self.learning_rate * b2grad)
>
> Does that mean, regardless of the loss/cost function, SGD would always
> follow the same procedure? since it updates the weights using only the
> gradient which is calculated without considering the chosen loss function.
The loss function is only computed to detect convergence. For
backprop, what matters is the derivative of the loss function, which
is your diff above.
> It would work for l-bfgs, since it accounts the cost, but SGD updates the
> weights without considering the cost/loss attained. The weight update I'm
> employing is the one described in Bishop's 2007 Pattern Recognition book,
> page 240
The gradient computation should be exactly the same for both
algorithms, since they're optimizing the same function. Ideally, the
code would be shared.
Many textbooks unfortunately explain backprop and SGD without
explaining that they're really two orthogonal concepts. SGD is an
optimizer, backprop is an algorithm for computing gradients that works
with various optimizers.
> Thank you very much in advance!
>
> Best regards,
> --Issam
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general