szhengac commented on issue #9881: Inconsistent weight decay logics in multiple optimizers URL: https://github.com/apache/incubator-mxnet/issues/9881#issuecomment-372439433 Unless explicitly specified, for most of the optimizers as implemented in packages such as TF and torch, wd is merged into the gradient before the gradient clipping. When the proximal operator is used, the wd term is not merged.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services