On Thu, Jun 4, 2015 at 1:26 PM, Till Rohrmann <[email protected]> wrote:
> Maybe also the default learning rate of 0.1 is set too high. > Could be. But grid search on learning rate is pretty standard practice. Running multiple learning engines at the same time with different learning rates is pretty plausible. Also, using something like adagrad will knock down high learning rates very quickly if you get a nearly divergent step. This can make initially high learning rates quite plausible.
