[GitHub] [madlib] orhankislal commented on a change in pull request #406: user docs for madlib-keras deep learning

GitBox Fri, 07 Jun 2019 07:49:48 -0700

orhankislal commented on a change in pull request #406: user docs for 
madlib-keras deep learning
URL: https://github.com/apache/madlib/pull/406#discussion_r291615954


 ##########
 File path: src/ports/postgres/modules/deep_learning/madlib_keras.sql_in
 ##########
 @@ -28,6 +28,1355 @@
 
 m4_include(`SQLCommon.m4')
 
+/**
+@addtogroup grp_keras
+
+<div class="toc"><b>Contents</b><ul>
+<li class="level1"><a href="#keras_fit">Fit</a></li>
+<li class="level1"><a href="#keras_evaluate">Evaluate</a></li>
+<li class="level1"><a href="#keras_predict">Predict</a></li>
+<li class="level1"><a href="#example">Examples</a></li>
+<li class="level1"><a href="#notes">Notes</a></li>
+<li class="level1"><a href="#background">Technical Background</a></li>
+<li class="level1"><a href="#literature">Literature</a></li>
+<li class="level1"><a href="#related">Related Topics</a></li>
+</ul></div>
+
+\warning <em> This MADlib method is still in early stage development.
+Interface and implementation are subject to change. </em>
+
+This module allows you to use SQL to call deep learning
+models designed in Keras [1], which is a high-level neural
+network API written in Python.
+Keras was developed for fast experimentation.  It can run
+on top of different backends and the one that is currently
+supported by MADlib is TensorFlow [2].  The implementation
+in MADlib is distributed and designed to train
+a single large model across multiple segments (workers)
+in a Greenplum database.
+
+The main use case is image classification
+using sequential models, which are made up of a
+linear stack of layers.  This includes multilayer perceptrons (MLPs)
+and convolutional neural networks (CNNs).  Regression is not
+currently supported.
+
+Before using Keras in MADlib you will need to mini-batch
+your training and evaluation datasets by calling the
+<a href="group__grp__input__preprocessor__dl.html">Preprocessor
+for Images</a> which is a utility that prepares image data for
+use by models that support mini-batch as an optimization option.
+This is a one-time operation and you would only
+need to re-run the preprocessor if your input data has changed.
+The advantage of using mini-batching is that it
+can perform better than stochastic gradient descent
+because it uses more than one training example at a time,
+typically resulting faster and smoother convergence [3].
+
+@brief Solves image classification problems by calling
+the Keras API
+
+@anchor keras_fit
+@par Fit
+The fit (training) function has the following format:
+
+<pre class="syntax">
+madlib_keras_fit(
+    source_table,
+    model,
+    model_arch_table,
+    model_arch_id,
+    compile_params,
+    fit_params,
+    num_iterations,
+    gpus_per_host,
+    validation_table,
+    metrics_compute_frequency,
+    warm_start,
+    name,
+    description
+    )
+</pre>
+
+\b Arguments
+<dl class="arglist">
+  <dt>source_table</dt>
+  <dd>TEXT. Name of the table containing the training data.
+  This is the name of the output
+  table from the image preprocessor.  Independent
+  and dependent variables are specified in the preprocessor
+  step which is why you do not need to explictly state
+  them here as part of the fit function.</dd>
+
+  <dt>model</dt>
+  <dd>TEXT. Name of the output table containing the model.
+  Details of the output table are shown below.
+  </dd>
+
+  <dt>model_arch_table</dt>
+  <dd>TEXT. Name of the table containing the model
+  architecture and (optionally) initial weights to use for
+  training.
+  </dd>
+
+  <dt>model_arch_id</dt>
+  <dd>INTEGER. This is the id in 'model_arch_table'
+  containing the model architecture and (optionally)
+  initial weights to use for training.
+  </dd>
+
+  <DT>compile_params</DT>
+  <DD>TEXT.
+    Parameters passed to the compile method of the Keras
+    model class [4].  These parameters will be passed through as is
+    so they must conform to the Keras API definition.
+    As an example, you might use something like: 
<em>loss='categorical_crossentropy', optimizer='adam', metrics=['acc']</em>.
+    The mandatory parameters that must be specified are 'optimizer'
+    and 'loss'.  Others are optional and will use the default
+    values as per Keras if not specified here.  Also, when
+    specifying 'loss' and 'metrics' do <em>not</em> include the
+    module and submodule prefixes
+    like <em>loss='losses.categorical_crossentropy'</em>
+    or <em>optimizer='keras.optmizers.adam'</em>.
+
+    @note
+    The following loss function is
+    not supported: <em>sparse_categorical_crossentropy</em>.
+    The following metrics are not
+    supported: <em>sparse_categorical_accuracy, top_k_categorical_accuracy, 
sparse_top_k_categorical_accuracy</em> and custom metrics.
+
+  </DD>
+
+  <DT>fit_params </DT>
+  <DD>TEXT.  Parameters passed to the fit method of the Keras
+    model class [4].  These will be passed through as is
+    so must conform to the Keras API definition.
+    As an example, you might use something like:
+    <em>batch_size=128, epochs=4</em>.
+    There are no mandatory parameters so
+    if you specify NULL, it will use all default
+    values as per Keras.
+  </DD>
+
+  <DT>num_iterations</DT>
+  <DD>INTEGER.  Number of iterations to train.
+  </DD>
+
+  <DT>gpus_per_host (optional)</DT>
+  <DD>INTEGER, default: 0 (i.e., CPU).
+    Number of GPUs per segment host to be used
+    for training the neural network.
+    For example, if you specify 4 for this parameter
+    and your database cluster is set up to have 4
+    segments per segment host, it means that each
+    segment will have a dedicated GPU.
+    A value of 0 means that CPUs, not GPUs, will
+    be used for training.
+
+    @note
+    We have seen some memory related issues when segments
+    share GPU resources.
+    For example, if you specify 1 for this parameter
+    and your database cluster is set up to have 4
+    segments per segment host, it means that all 4
+    segments on a segment host will share the same
+    GPU. The current recommended
+    configuration is 1 GPU per segment.
+  </DD>
+
+  <dt>validation_table (optional)</dt>
+  <dd>TEXT, default: none. Name of the table containing
+  the validation dataset.
+  Note that the validation dataset must be preprocessed
+  in the same way as the training dataset, so this
+  is the name of the output
+  table from running the image preprocessor on the validation dataset.
+  Using a validation dataset can mean a
+  longer training time, depending on its size.
+  This can be controlled using the 'metrics_compute_frequency'
+  paremeter described below.</dd>
+
+  <DT>metrics_compute_frequency (optional)</DT>
+  <DD>INTEGER, default: once at the end of training
+  after 'num_iterations'.  Frequency to compute per-iteration
+  metrics for the training dataset and validation dataset
+  (if specified).  There can be considerable cost to
+  computing metrics every iteration, especially if the
+  training dataset is large.  This parameter is a way of
+  controlling the frequency of those computations.
+  For example, if you specify 5, then metrics will be computed
+  every 5 iterations as well as at the end of training
+  after 'num_iterations'.  If you use the default,
+  metrics will be computed only
+  once after 'num_iterations' have completed.
+  </DD>
+
+  <DT>warm_start (optional)</DT>
+  <DD>BOOLEAN, default: FALSE.
+    Initalize weights with the coefficients
+    from the last call of the fit
+    function. If set to TRUE, weights will be
+    initialized from the model table
+    generated by the previous training run.
+
+    @note
+    The warm start feature works based on the name of the
+    model output table from a previous training run.
+    When using warm start, do not drop the model output table
+    or the model output summary table
+    before calling the fit function, since these are needed to obtain the
+    weights from the previous run.
+    If you are not using warm start, the model output table
+    and the model output table summary must be dropped in
+    the usual way before calling the training function.
+  </DD>
+
+  <DT>name (optional)</DT>
+  <DD>TEXT, default: NULL.
+    Free text string to identify a name, if desired.
+  </DD>
+
+  <DT>description (optional)</DT>
+  <DD>TEXT, default: NULL.
+    Free text string to provide a description, if desired.
+  </DD>
+</dl>
+
+<b>Output tables</b>
+<br>
+    The model table produced by fit contains the following columns:
+    <table class="output">
+      <tr>
+        <th>model_data</th>
+        <td>BYTEA8. Byte array containing the weights of the neural net.</td>
+      </tr>
+      <tr>
+        <th>model_arch</th>
+        <td>TEXT. A JSON representation of the model architecture
+        used in training.</td>
+      </tr>
+    </table>
+
+A summary table named \<model\>_summary is also created, which has the 
following columns:
+    <table class="output">
+    <tr>
+        <th>source_table</th>
+        <td>Source table used for training.</td>
+    </tr>
+    <tr>
+        <th>model</th>
+        <td>Model output table produced by training.</td>
+    </tr>
+    <tr>
+        <th>independent_varname</th>
+        <td>Independent variables column from the original
+        source table in the image preprocessing step.</td>
+    </tr>
+    <tr>
+        <th>dependent_varname</th>
+        <td>Dependent variable column from the original
+        source table in the image preprocessing step.</td>
+    </tr>
+    <tr>
+        <th>model_arch_table</th>
+        <td>Name of the table containing
+        the model architecture and (optionally) the
+        initial model weights.</td>
+    </tr>
+    <tr>
+        <th>model_arch_table_id</th>
+        <td>The id of the model in
+        the model architecture table used for training.</td>
+    </tr>
+    <tr>
+        <th>compile_params</th>
+        <td>Compile parameters passed to Keras.</td>
+    </tr>
+    <tr>
+        <th>fit_params</th>
+        <td>Fit parameters passed to Keras.</td>
+    </tr>
+    <tr>
+        <th>num_iterations</th>
+        <td>Number of iterations of training completed.</td>
+    </tr>
+    <tr>
+        <th>validation_table</th>
+        <td>Name of the table containing
+        the validation dataset (if specified).</td>
+    </tr>
+    <tr>
+        <th>metrics_compute_frequency</th>
+        <td>Frequency that per-iteration metrics are computed
+        for the training dataset and validation
+        dataset.</td>
+    </tr>
+    <tr>
+        <th>name</th>
+        <td>Name of the training run (free text).</td>
+    </tr>
+    <tr>
+        <th>description</th>
+        <td>Description of the training run (free text).</td>
+    </tr>
+    <tr>
+        <th>model_type</th>
+        <td>General identifier for type of model trained.
+        Currently says 'madlib_keras'.</td>
+    </tr>
+    <tr>
+        <th>model_size</th>
+        <td>Size of the model in KB.  Models are stored in
+        'bytea' data format which is used for binary strings
+        in PostgreSQL type databases.</td>
+    </tr>
+    <tr>
+        <th>start_training_time</th>
+        <td>Timestamp for start of training.</td>
+    </tr>
+    <tr>
+        <th>end_training_time</th>
+        <td>Timestamp for end of training.</td>
+    </tr>
+    <tr>
+        <th>metrics_elapsed_time</th>
+        <td> Array of elapsed time for metric computations as
+        per the 'metrics_compute_frequency' parameter.
+        Useful for drawing a curve showing loss, accuracy or
+        other metrics as a function of time.
+        For example, if 'metrics_compute_frequency=5'
+        this would be an array of elapsed time for every 5th
+        iteration, plus the last iteration.</td>
+    </tr>
+    <tr>
+        <th>madlib_version</th>
+        <td>Version of MADlib used.</td>
+    </tr>
+    <tr>
+        <th>num_classes</th>
+        <td>Count of distinct classes values used.</td>
+    </tr>
+    <tr>
+        <th>class_values</th>
+        <td>Array of actual class values used.</td>
+    </tr>
+    <tr>
+        <th>dependent_vartype</th>
+        <td>Data type of the dependent variable.</td>
+    </tr>
+    <tr>
+        <th>normalizing_constant</th>
+        <td>Normalizing constant used from the
+        image preprocessing step.</td>
+    </tr>
+    <tr>
+        <th>metrics_type</th>
+        <td>Metric specified in the 'compile_params'.</td>
+    </tr>
+    <tr>
+        <th>training_metrics_final</th>
+        <td>Final value of the training
+        metric after all iterations have completed.
+        The metric reported is the one
+        specified in the 'metrics_type' parameter.</td>
+    </tr>
+    <tr>
+        <th>training_loss_final</th>
+        <td>Final value of the training loss after all
+        iterations have completed.</td>
+    </tr>
+    <tr>
+        <th>training_metrics</th>
+        <td>Array of training metrics as
+        per the 'metrics_compute_frequency' parameter.
+        For example, if 'metrics_compute_frequency=5'
+        this would be an array of metrics for every 5th
+        iteration, plus the last iteration.</td>
+    </tr>
+    <tr>
+        <th>training_loss</th>
+        <td>Array of training losses as
+        per the 'metrics_compute_frequency' parameter.
+        For example, if 'metrics_compute_frequency=5'
+        this would be an array of losses for every 5th
+        iteration, plus the last iteration.</td>
+    </tr>
+    <tr>
+        <th>validation_metrics_final</th>
+        <td>Final value of the validation
+        metric after all iterations have completed.
+        The metric reported is the one
+        specified in the 'metrics_type' parameter.</td>
+    </tr>
+    <tr>
+        <th>validation_loss_final</th>
+        <td>Final value of the validation loss after all
+        iterations have completed.</td>
+    </tr>
+    <tr>
+        <th>validation_metrics</th>
+        <td>Array of validation metrics as
+        per the 'metrics_compute_frequency' parameter.
+        For example, if 'metrics_compute_frequency=5'
+        this would be an array of metrics for every 5th
+        iteration, plus the last iteration.</td>
+    </tr>
+    <tr>
+        <th>validation_loss</th>
+        <td>Array of validation losses as
+        per the 'metrics_compute_frequency' parameter.
+        For example, if 'metrics_compute_frequency=5'
+        this would be an array of losses for every 5th
+        iteration, plus the last iteration.</td>
+    </tr>
+    <tr>
+        <th>metrics_iters</th>
+        <td>Array indicating the iterations for which
+        metrics are calculated, as derived from the
+        parameters 'num_iterations' and 'metrics_compute_frequency'.
+        For example, if 'num_iterations=5'
+        and 'metrics_compute_frequency=2', then 'metrics_iters' value
+        would be {2,4,5} indicating that metrics were computed
+        at iterations 2, 4 and 5 (at the end).
+        If 'num_iterations=5'
+        and 'metrics_compute_frequency=1', then 'metrics_iters' value
+        would be {1,2,3,4,5} indicating that metrics were computed
+        at every iteration.</td>
+    </tr>
+   </table>
+
+@anchor keras_evaluate
+@par Evaluate
+The evaluation function has the following format:
+
+<pre class="syntax">
+madlib_keras_evaluate(
+    model_table,
+    test_table,
+    output_table,
+    gpus_per_host
+    )
+</pre>
+
+\b Arguments
+<dl class="arglist">
+
+<DT>model_table</DT>
+  <DD>TEXT. Name of the table containing the model
+  to use for validation.
+  </DD>
+
+  <DT>test_table</DT>
+  <dd>TEXT. Name of the table containing the evaluation dataset.
+  Note that validation data must be preprocessed in the same
 
 Review comment:
   validation data -> test data

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [madlib] orhankislal commented on a change in pull request #406: user docs for madlib-keras deep learning

Reply via email to