TristonC commented on issue #18751: URL: https://github.com/apache/incubator-mxnet/issues/18751#issuecomment-663168367
I am not sure that putting the running mean and running var into the backward pass is the solution. It can be achieved by setting autograd.record(train_mode=False). The problem here (NaN) is the way of computing running var, be it un-biased (divided by m -1 and as shown in the BN paper) or biased (by m). This is a corner case while m is 1 when the batch size is one and BN is after dense (or flatten) lay84er. My question goes is whether having 9408 or 32 running means and running vars really the network trying to do. BTW, the NaN is something frustrating for end users, we need to find a way to solve it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org