[ https://issues.apache.org/jira/browse/SYSTEMML-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Janardhan updated SYSTEMML-1159: -------------------------------- Comment: was deleted (was: Hi [~dusenberrymw], I think your idea of global sharing of whole dataset by all the models ( in parallel ), can be implemented by the blackboard system. Please once navigate to the brainstorming on parameter servers at systemml-1695) > Enable Remote Hyperparameter Tuning > ----------------------------------- > > Key: SYSTEMML-1159 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1159 > Project: SystemML > Issue Type: Improvement > Affects Versions: SystemML 1.0 > Reporter: Mike Dusenberry > Priority: Blocker > > Training a parameterized machine learning model (such as a large neural net > in deep learning) requires learning a set of ideal model parameters from the > data, as well as determining appropriate hyperparameters (or "settings") for > the training process itself. In the latter case, the hyperparameters (i.e. > learning rate, regularization strength, dropout percentage, model > architecture, etc.) can not be learned from the data, and instead are > determined via a search across a space for each hyperparameter. For large > numbers of hyperparameters (such as in deep learning models), the current > literature points to performing staged, randomized grid searches over the > space to produce distributions of performance, narrowing the space after each > search \[1]. Thus, for efficient hyperparameter optimization, it is > desirable to train several models in parallel, with each model trained over > the full dataset. For deep learning models, a mini-batch training approach > is currently state-of-the-art, and thus separate models with different > hyperparameters could, conceivably, be easily trained on each of the nodes in > a cluster. > In order to allow for the training of deep learning models, SystemML needs to > determine a solution to enable this scenario with the Spark backend. > Specifically, if the user has a {{train}} function that takes a set of > hyperparameters and trains a model with a mini-batch approach (and thus is > only making use of single-node instructions within the function), the user > should be able to wrap this function with, for example, a remote {{parfor}} > construct that samples hyperparameters and calls the {{train}} function on > each machine in parallel. > To be clear, each model would need access to the entire dataset, and each > model would be trained independently. > \[1]: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf -- This message was sent by Atlassian JIRA (v6.4.14#64029)