[ https://issues.apache.org/jira/browse/SYSTEMML-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090453#comment-16090453 ]
Mike Dusenberry edited comment on SYSTEMML-1159 at 7/17/17 8:02 PM: -------------------------------------------------------------------- [~return_01] Thanks–adding HogWild asynchronous SGD would be quite interesting. However, this particular JIRA issue is referring to *hyperparameters* rather than the model parameters, the latter of which HogWild is applicable. If you are interested in pursuing the addition of support for HogWild, could you please create a new JIRA issue for it, and link it to SYSTEMML-540? SYSTEMML-1563 may also be of interest -- I added a distributed synchronous SGD algorithm a while back, implemented currently in the [distributed MNIST LeNet | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml] algorithm. We are currently working to improve the engine performance of it in SYSTEMML-1760. was (Author: mwdus...@us.ibm.com): [~return_01] Thanks–adding HogWild asynchronous SGD would be quite interesting. However, this particular JIRA issue is referring to *hyperparameters* rather than the model parameters, the latter of which HogWild is applicable. If you are interested in pursuing the addition of support for HogWild, could you please create a new JIRA issue for it, and link it to SYSTEMML-540? SYSTEMML-1563 may also be of interest -- I added a distributed synchronous SGD algorithm, implemented currently in [distributed MNIST LeNet | https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml] algorithm. > Enable Remote Hyperparameter Tuning > ----------------------------------- > > Key: SYSTEMML-1159 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1159 > Project: SystemML > Issue Type: Improvement > Affects Versions: SystemML 1.0 > Reporter: Mike Dusenberry > Priority: Blocker > > Training a parameterized machine learning model (such as a large neural net > in deep learning) requires learning a set of ideal model parameters from the > data, as well as determining appropriate hyperparameters (or "settings") for > the training process itself. In the latter case, the hyperparameters (i.e. > learning rate, regularization strength, dropout percentage, model > architecture, etc.) can not be learned from the data, and instead are > determined via a search across a space for each hyperparameter. For large > numbers of hyperparameters (such as in deep learning models), the current > literature points to performing staged, randomized grid searches over the > space to produce distributions of performance, narrowing the space after each > search \[1]. Thus, for efficient hyperparameter optimization, it is > desirable to train several models in parallel, with each model trained over > the full dataset. For deep learning models, a mini-batch training approach > is currently state-of-the-art, and thus separate models with different > hyperparameters could, conceivably, be easily trained on each of the nodes in > a cluster. > In order to allow for the training of deep learning models, SystemML needs to > determine a solution to enable this scenario with the Spark backend. > Specifically, if the user has a {{train}} function that takes a set of > hyperparameters and trains a model with a mini-batch approach (and thus is > only making use of single-node instructions within the function), the user > should be able to wrap this function with, for example, a remote {{parfor}} > construct that samples hyperparameters and calls the {{train}} function on > each machine in parallel. > To be clear, each model would need access to the entire dataset, and each > model would be trained independently. > \[1]: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf -- This message was sent by Atlassian JIRA (v6.4.14#64029)