[GitHub] madlib issue #251: MLP: Simplify initialization of model coefficients
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/251 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/411/ ---
[GitHub] madlib issue #251: MLP: Simplify initialization of model coefficients
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/251 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/410/ ---
[GitHub] madlib issue #251: MLP: Simplify initialization of model coefficients
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/251 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/409/ ---
[GitHub] madlib pull request #251: MLP: Simplify initialization of model coefficients
GitHub user iyerr3 opened a pull request: https://github.com/apache/madlib/pull/251 MLP: Simplify initialization of model coefficients Changes: 1. Model initialization now happens in the C++ code instead of being passed via Python. 2. Warm start coefficients are still passed from Python. These coefficients are copied into the model. The vectorized form of the copy raises an Eigen assertion when used with a MappedMatrix, possibly due to the actual block size not being available via the Map. 3. The distinction between a "TaskState" and an "AlgoState" within a State type has been removed in the MLPMiniBatchState. This demarcation seemed to be confusing and didn't provide much abstraction. Closes #251 You can merge this pull request into a Git repository by running: $ git pull https://github.com/iyerr3/incubator-madlib bugfix/minibatch_init_debug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/251.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #251 ---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/248 ---
[GitHub] madlib pull request #249: RF: Use NULL::integer[] when no continuous feature...
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/249 ---
[GitHub] madlib pull request #246: DT and RF user doc updates
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/246 ---
[GitHub] madlib issue #249: RF: Use NULL::integer[] when no continuous features
Github user iyerr3 commented on the issue: https://github.com/apache/madlib/pull/249 @kaknikhil Thanks for the suggestion. It makes sense to use that function with integer types excluded (since DT does not treat integer as continuous). I'll add that as a separate commit since it's unrelated to this PR. ---
[GitHub] madlib issue #249: RF: Use NULL::integer[] when no continuous features
Github user kaknikhil commented on the issue: https://github.com/apache/madlib/pull/249 The changes look good. +1 for adding an install check test. I noticed that `con_features` is populated by `decision_tree.py:_classify_features` which does something like `con_types = ['real', 'float8', 'double precision', 'numeric']`. Is it better to call the utilities function `is_psql_numeric_type` ? ---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/248#discussion_r177503935 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in --- @@ -970,16 +970,35 @@ def _get_bins_grps( "one value are dropped from the tree model.") cat_features = [feature for feature in cat_features if feature in use_cat_features] -grp_to_col_to_row = dict((grp_key, dict( -(row['colname'], row['levels']) for row in items)) -for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))) - +# grp_col_to_levels is a list of tuples (pairs) with +# first value = group value, +# second value = a dict mapping a categorical column to its levels in data +# (these levels are specific to the group and can be different +# for different groups) +# The list of tuples can be converted to a dict, but the ordering +# will be lost. +# eg. grp_col_to_levels = +# [ +# ('3', {'vs': [0, 1], 'cyl': [4,6,8]}), +# ('4', {'vs': [0, 1], 'cyl': [4,6]}), +# ('5', {'vs': [0, 1], 'cyl': [4,6,8]}) +# ] +grp_to_col_to_levels = [ +(grp_key, dict((row['colname'], row['levels']) for row in items)) +for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))] if cat_features: -cat_items_list = [rows[col] for col in cat_features - for grp_key, rows in grp_to_col_to_row.items() if col in rows] +# Below statements collect the grp_to_col_to_levels into multiple variables +# From above eg. +# cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]] +# cat_n = [2, 3, 2, 2, 2, 3] +# cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8] --- End diff -- Yes, thanks for the catch. I'll update before merging. ---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/248#discussion_r177495864 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in --- @@ -970,16 +970,35 @@ def _get_bins_grps( "one value are dropped from the tree model.") cat_features = [feature for feature in cat_features if feature in use_cat_features] -grp_to_col_to_row = dict((grp_key, dict( -(row['colname'], row['levels']) for row in items)) -for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))) - +# grp_col_to_levels is a list of tuples (pairs) with +# first value = group value, +# second value = a dict mapping a categorical column to its levels in data +# (these levels are specific to the group and can be different +# for different groups) +# The list of tuples can be converted to a dict, but the ordering +# will be lost. +# eg. grp_col_to_levels = +# [ +# ('3', {'vs': [0, 1], 'cyl': [4,6,8]}), +# ('4', {'vs': [0, 1], 'cyl': [4,6]}), +# ('5', {'vs': [0, 1], 'cyl': [4,6,8]}) +# ] +grp_to_col_to_levels = [ +(grp_key, dict((row['colname'], row['levels']) for row in items)) +for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))] if cat_features: -cat_items_list = [rows[col] for col in cat_features - for grp_key, rows in grp_to_col_to_row.items() if col in rows] +# Below statements collect the grp_to_col_to_levels into multiple variables +# From above eg. +# cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]] +# cat_n = [2, 3, 2, 2, 2, 3] +# cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8] +# grp_key_cat = ['3', '4', '5'] --- End diff -- +1 for the examples, very helpful. ---
[GitHub] madlib pull request #248: DT: Ensure proper quoting in grouping coalesce
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/248#discussion_r177496541 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in --- @@ -970,16 +970,35 @@ def _get_bins_grps( "one value are dropped from the tree model.") cat_features = [feature for feature in cat_features if feature in use_cat_features] -grp_to_col_to_row = dict((grp_key, dict( -(row['colname'], row['levels']) for row in items)) -for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))) - +# grp_col_to_levels is a list of tuples (pairs) with +# first value = group value, +# second value = a dict mapping a categorical column to its levels in data +# (these levels are specific to the group and can be different +# for different groups) +# The list of tuples can be converted to a dict, but the ordering +# will be lost. +# eg. grp_col_to_levels = +# [ +# ('3', {'vs': [0, 1], 'cyl': [4,6,8]}), +# ('4', {'vs': [0, 1], 'cyl': [4,6]}), +# ('5', {'vs': [0, 1], 'cyl': [4,6,8]}) +# ] +grp_to_col_to_levels = [ +(grp_key, dict((row['colname'], row['levels']) for row in items)) +for grp_key, items in groupby(all_levels, key=itemgetter('grp_key'))] if cat_features: -cat_items_list = [rows[col] for col in cat_features - for grp_key, rows in grp_to_col_to_row.items() if col in rows] +# Below statements collect the grp_to_col_to_levels into multiple variables +# From above eg. +# cat_items_list = [[0,1], [4,6,8], [0,1], [4,6], [0,1], [4,6,8]] +# cat_n = [2, 3, 2, 2, 2, 3] +# cat_n = [0, 1, 4, 6, 8, 0, 1, 4, 6, 0, 1, 4, 6, 8] --- End diff -- Should this be `cat_origin` instead? ---