Nandish Jayaram created MADLIB-1314:
---------------------------------------

             Summary: Add optional num_classes param for minibatch preprocessor 
for DL
                 Key: MADLIB-1314
                 URL: https://issues.apache.org/jira/browse/MADLIB-1314
             Project: Apache MADlib
          Issue Type: New Feature
          Components: Deep Learning, Module: Utilities
            Reporter: Nandish Jayaram
             Fix For: v1.16


The current `minibatch_preprocessor_dl` module looks at the input table to find 
the number of distinct categories (class values) for the dependent variable, 
and uses that number as the size of the one-hot-encoded array. This could lead 
a failure in madlib_keras fit function if the `num_classes` defined in the 
architecture is a number greater/different than the size of the one hot encoded 
array.
This could be a fairly common scenario, for example:
Say original data set is places 350, but we decide to sample a subset. That 
subset may not have all 350 classes (assume it has only 10 classes in it), but 
the model we have already defined is for places 350 (so num_classes there would 
be specified as 350, and the final layer would have that many units). So we 
will have to change the model architecture to work with this sampled dataset if 
we do not support this feature where we create one-hot encoded vector of size 
350 despite finding only 10 class values in the input dataset.

Acceptance:
1. Add optional `num_classes` param of type integer.
1. one hot encoded array must be of size `num_classes` if specified, else use 
the distinct number of class values for it.
1. Fail if `num_classes < distinct class values found in dataset`.
1. `class_values` column in summary table must have `NULL` as the entry for 
class values that do not exist in the input table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to