Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/madlib/pull/246#discussion_r176661267
--- Diff:
src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in ---
@@ -97,327 +264,220 @@ forest_train(training_table_name,
<tr>
<th>is_classification</th>
- <td>boolean. True if it is a classification model.</td>
+ <td>BOOLEAN. True if it is a classification model, false
+ if for regression.</td>
</tr>
<tr>
<th>source_table</th>
- <td>text. Data source table name.</td>
+ <td>TEXT. Data source table name.</td>
</tr>
<tr>
<th>model_table</th>
- <td>text. Model table name.</td>
+ <td>TEXT. Model table name.</td>
</tr>
<tr>
<th>id_col_name</th>
- <td>text. The ID column name.</td>
+ <td>TEXT. The ID column name.</td>
</tr>
<tr>
<th>dependent_varname</th>
- <td>text. Dependent variable.</td>
+ <td>TEXT. Dependent variable.</td>
</tr>
<tr>
- <th>independent_varname</th>
- <td>text. Independent variables</td>
+ <th>independent_varnames</th>
+ <td>TEXT. Independent variables</td>
</tr>
<tr>
<th>cat_features</th>
- <td>text. Categorical feature names.</td>
+ <td>TEXT. List of categorical features
+ as a comma-separated string.</td>
</tr>
<tr>
<th>con_features</th>
- <td>text. Continuous feature names.</td>
+ <td>TEXT. List of continuous feature
+ as a comma-separated string.</td>
</tr>
<tr>
- <th>grouping_col</th>
- <td>int. Names of grouping columns.</td>
+ <th>grouping_cols</th>
+ <td>INTEGER. Names of grouping columns.</td>
</tr>
<tr>
<th>num_trees</th>
- <td>int. Number of trees grown by the model.</td>
+ <td>INTEGER. Number of trees grown by the model.</td>
</tr>
<tr>
<th>num_random_features</th>
- <td>int. Number of features randomly selected for each split.</td>
+ <td>INTEGER. Number of features randomly selected for each
split.</td>
</tr>
<tr>
<th>max_tree_depth</th>
- <td>int. Maximum depth of any tree in the random forest
model_table.</td>
+ <td>INTEGER. Maximum depth of any tree in the random forest
model_table.</td>
</tr>
<tr>
<th>min_split</th>
- <td>int. Minimum number of observations in a node for it to be
split.</td>
+ <td>INTEGER. Minimum number of observations in a node for it to be
split.</td>
</tr>
<tr>
<th>min_bucket</th>
- <td>int. Minimum number of observations in any terminal node.</td>
+ <td>INTEGER. Minimum number of observations in any terminal
node.</td>
</tr>
<tr>
<th>num_splits</th>
- <td>int. Number of buckets for continuous variables.</td>
+ <td>INTEGER. Number of buckets for continuous variables.</td>
</tr>
<tr>
<th>verbose</th>
- <td>boolean. Whether or not to display debug info.</td>
+ <td>BOOLEAN. Whether or not to display debug info.</td>
</tr>
<tr>
<th>importance</th>
- <td>boolean. Whether or not to calculate variable importance.</td>
+ <td>BOOLEAN. Whether or not to calculate variable importance.</td>
</tr>
<tr>
<th>num_permutations</th>
- <td>int. Number of times feature values are permuted while
calculating
- variable importance. The default value is 1.</td>
+ <td>INTEGER. Number of times feature values are permuted while
calculating
+ variable importance.</td>
</tr>
<tr>
<th>num_all_groups</th>
- <td>int. Number of groups during forest training.</td>
+ <td>INTEGER. Number of groups during forest training.</td>
</tr>
<tr>
<th>num_failed_groups</th>
- <td>int. Number of failed groups during forest training.</td>
+ <td>INTEGER. Number of failed groups during forest training.</td>
</tr>
<tr>
<th>total_rows_processed</th>
- <td>bigint. Total numbers of rows processed in all groups.</td>
+ <td>BIG INTEGER. Total numbers of rows processed in all groups.</td>
--- End diff --
This is `BIGINT`. Postgres doesn't expand on the `INT` for this type. Same
comment for the next item as well.
---