atiyo opened a new issue #7637: Strange Validation and Training Losses at epoch change URL: https://github.com/apache/incubator-mxnet/issues/7637 I struggled to get some mxnet models training to a good accuracy, so I took a closer look at training and validation losses of a toy model. I noticed some strange spikes between epochs, which surprised me. I expect I'm likely doing something wrong, but I can't see what: I have tried several optimisers, with learning rates spanning several orders of magnitude. It's most plausible that I'm doing something drastically wrong, being new to mxnet. Graphic below illustrating the phenomenon, and also code to reproduce figure: ![adam_loss](https://user-images.githubusercontent.com/12828061/29753519-35b2e6e2-8b6b-11e7-8c08-14b8730efceb.png) ``` import mxnet as mx import numpy as np optimizer_choice = 'adam' learning_rate = 0.01 batch_size = 500 inputs = np.expand_dims(np.random.uniform(low=0., high=2*np.pi, size=10000), axis=1) labels = np.sin(inputs) eval_inputs = np.expand_dims(np.random.uniform(low=0., high=2*np.pi, size=10000), axis=1) eval_labels = np.sin(eval_inputs) data_iter = mx.io.NDArrayIter(data={'data':inputs}, label={'label':labels}, batch_size=batch_size, shuffle=True) eval_data_iter = mx.io.NDArrayIter(data={'data':eval_inputs}, label={'label':eval_labels}, batch_size=batch_size, shuffle=True) data = mx.sym.Variable('data') label = mx.sym.Variable('label') fc1 = mx.sym.FullyConnected(data=data, num_hidden=128) ac1 = mx.sym.Activation(data=fc1, act_type='relu') fc2 = mx.sym.FullyConnected(data=ac1, num_hidden=64) ac2 = mx.sym.Activation(data=fc2, act_type='relu') fc3 = mx.sym.FullyConnected(data=ac2, num_hidden=16) ac3 = mx.sym.Activation(data=fc3, act_type='relu') fc4 = mx.sym.FullyConnected(data=ac3, num_hidden=1) ac4 = mx.sym.Activation(data=fc4, act_type='tanh') loss = mx.symbol.LinearRegressionOutput(data=ac4, label=label) net = mx.module.Module(symbol=loss, data_names=['data'], label_names=['label']) train_error = [] eval_error = [] def log_error(period, log): def _callback(param): if param.nbatch % period == 0: name, value = param.eval_metric.get() log.append(value) return _callback optimizer_params={'learning_rate':learning_rate} net.fit(data_iter, optimizer=optimizer_choice, optimizer_params=optimizer_params, eval_data=eval_data_iter, eval_metric='mse', num_epoch=5, epoch_end_callback = mx.callback.do_checkpoint('test_net'), eval_batch_end_callback = log_error(1,eval_error), batch_end_callback = log_error(1,train_error) ) train_error = np.array(train_error) eval_error = np.array(eval_error) import matplotlib.pyplot as plt plt.plot(np.arange(train_error.size),train_error, label = 'Training Error') plt.plot(np.arange(eval_error.size), eval_error, label = 'Validation Error') plt.legend(loc='upper right') plt.xlabel('Batch Number') plt.ylabel('Error') plt.title('Optimizer: {}. Learning Rate: {}'.format(optimizer_choice,learning_rate)) plt.gca().set_ylim(bottom=0) plt.show() ``` ## Environment info Operating System: macOS MXNet version: 0.11.0 Python version and distribution: Python 2.7.13 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services