[GitHub] madlib pull request #344: Add kd-tree option to knn.
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/344 Add kd-tree option to knn. This commits add the a partial kd-tree implementation to be used for knn operations. This function is designed to work independently in case some future modules require its functionality. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/kd-tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #344 commit 05ed2e172070f0d49baf8b04aed5a3ba42c1f418 Author: Orhan Kislal Date: 2018-12-06T08:08:33Z Add kd-tree option to knn. This commits add the a partial kd-tree implementation to be used for knn operations. This function is designed to work independently in case some future modules require its functionality. ---
[GitHub] madlib pull request #343: Linear Regression: Support for JSON and special ch...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/343#discussion_r245023325 --- Diff: src/ports/postgres/modules/regress/linear.py_in --- @@ -185,10 +221,12 @@ def _validate_args(schema_madlib, source_table, out_table, dependent_varname, if grouping_cols is not None: _assert(grouping_cols != '', "Linregr error: Invalid grouping columns name!") +# grouping columns can be a valid expression as well, for eg. +# a json expression (data->>'id'), so commenting this part. grouping_list = _string_to_array_with_quotes(grouping_cols) -_assert(columns_exist_in_table( -source_table, grouping_list, schema_madlib), -"Linregr error: Grouping column does not exist!") +#_assert(columns_exist_in_table( --- End diff -- We should clean up these comments before the merge. ---
[GitHub] madlib pull request #343: Linear Regression: Support for JSON and special ch...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/343#discussion_r245022983 --- Diff: src/ports/postgres/modules/regress/linear.py_in --- @@ -134,11 +170,11 @@ def linregr_train(schema_madlib, source_table, out_table, 'linregr'::varchar as method , '{source_table}'::varchar as source_table , '{out_table}'::varchar as out_table -, '{dependent_varname}'::varchar as dependent_varname -, '{independent_varname}'::varcharas independent_varname +, $${dependent_varname}$$::varchar as dependent_varname +, $${independent_varname}$$::varcharas independent_varname , {num_rows_processed}::integer as num_rows_processed , {num_rows_skipped}::integer as num_missing_rows_skipped -, {grouping_col}::textas grouping_col +, $${grouping_col}$$::textas grouping_col --- End diff -- These additional quotes around the grouping columns break the PMML tests. ---
[GitHub] madlib pull request #339: Build: Add PG11 Support
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/339#discussion_r237604025 --- Diff: src/ports/postgres/modules/kmeans/kmeans.sql_in --- @@ -766,15 +766,30 @@ BEGIN proc_fn_dist := fn_dist || '(DOUBLE PRECISION[], DOUBLE PRECISION[])'; -IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE -FROM pg_proc WHERE oid = proc_fn_dist) THEN -RAISE EXCEPTION 'Kmeans error: Distance function has wrong signature or is not a simple function.'; -END IF; -proc_agg_centroid := agg_centroid || '(DOUBLE PRECISION[])'; -IF (SELECT prorettype != 'DOUBLE PRECISION[]'::regtype OR proisagg = FALSE -FROM pg_proc WHERE oid = proc_agg_centroid) THEN -RAISE EXCEPTION 'Kmeans error: Mean aggregate has wrong signature or is not an aggregate.'; + +-- Handle PG11 pg_proc table changes --- End diff -- I tried this method but it requires casting `regprocedure` to `varchar`. This is allowed on PG versions after 8.3. On earlier versions, we have to use `textin` function. This means we will need another if check for GPDB4.3. ---
[GitHub] madlib pull request #339: Build: Add PG11 Support
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/339 Build: Add PG11 Support JIRA: MADLIB-1283 PG11 support required a number of minor changes in the code. - Change TRUE/FALSE to true/false - Use TupleDescAttr function instead of direct access. - Use prokind column instead of proisagg. We also added a function to check if the PG version is earlier than 11 as well as the necessary cmake files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib build/pg-11-support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/339.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #339 commit b63303c5bbcebeb82ab03694e4b3dade7d1827ab Author: Orhan Kislal Date: 2018-11-19T16:02:53Z Build: Add PG11 Support JIRA: MADLIB-1283 PG11 support required a number of minor changes in the code. - Change TRUE/FALSE to true/false - Use TupleDescAttr function instead of direct access. - Use prokind column instead of proisagg. We also added a function to check if the PG version is earlier than 11 as well as the necessary cmake files. ---
[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/337#discussion_r232282360 --- Diff: src/madpack/diff_udo.sql --- @@ -0,0 +1,81 @@ +-- +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at + +-- http://www.apache.org/licenses/LICENSE-2.0 + +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. +-- + +SET client_min_messages to ERROR; +\x on + +CREATE OR REPLACE FUNCTION filter_schema(argstr text, schema_name text) +RETURNS text AS $$ +if argstr is None: +return "NULL" +return argstr.replace(schema_name + ".", '') +$$ LANGUAGE plpythonu; + +CREATE OR REPLACE FUNCTION alter_schema(argstr text, schema_name text) +RETURNS text AS $$ +if argstr is None: +return "NULL" +return argstr.replace(schema_name + ".", 'schema_madlib.') +$$ LANGUAGE plpythonu; + + +CREATE OR REPLACE FUNCTION get_udos(table_name text, schema_name text, + type_filter text) +RETURNS VOID AS +$$ +import plpy + +plpy.execute(""" +create table {table_name} AS +SELECT * +FROM ( +SELECT n.nspname AS "Schema", + o.oprname AS name, + filter_schema(o.oprcode::text, '{schema_name}') AS oprcode, + alter_schema(pg_catalog.format_type(o.oprleft, NULL), '{schema_name}') AS oprleft, + alter_schema(pg_catalog.format_type(o.oprright, NULL), '{schema_name}') AS oprright, + alter_schema(pg_catalog.format_type(o.oprresult, NULL), '{schema_name}') AS rettype +FROM pg_catalog.pg_operator o +LEFT JOIN pg_catalog.pg_namespace n ON n.oid = o.oprnamespace +WHERE n.nspname OPERATOR(pg_catalog.~) '^({schema_name})$' --- End diff -- I use the `\do madlib.*` command of `psql` as a basis. The corresponding query (you can get this if you start with `psql -E`) uses this particular phrase to get all of the operators of a particular schema. Basically, this regex looks at the schema name(n.nspname) and filters that don't start (^) and end ($) with madlib schema name. ---
[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/337#discussion_r232279216 --- Diff: src/madpack/create_changelist.py --- @@ -237,6 +325,13 @@ print "Something went wrong! The changelist might be wrong/corrupted." raise finally: -os.system("rm -f /tmp/madlib_tmp_nm.txt /tmp/madlib_tmp_udf.txt " - "/tmp/madlib_tmp_udt.txt /tmp/madlib_tmp_cl.yaml " --- End diff -- Nice catch, it should still be removed. ---
[GitHub] madlib pull request #337: Madpack: Add UDO and UDOC automation
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/337 Madpack: Add UDO and UDOC automation JIRA: MADLIB-1281 - Add scripts for detecting changed/dropped UDOs and UDOCs. - Expand the create_changelist.py file to consume these scripts and create changelists with these fields filled if necessary. - Fix the update_util.py to use the correct dictionary key. - Add drop operator class command to the svac.sql_in to make sure the old class is removed before creating the updated one. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib madpack/complete-changelist Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/337.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #337 commit 09c3bb2e55417309a45f0729f370920273be40b4 Author: Orhan Kislal Date: 2018-10-24T12:55:34Z Madpack: Add UDO and UDOC automation JIRA: MADLIB-1281 - Add scripts for detecting changed/dropped UDOs and UDOCs. - Expand the create_changelist.py file to consume these scripts and create changelists with these fields filled if necessary. - Fix the update_util.py to use the correct dictionary key. - Add drop operator class command to the svac.sql_in to make sure the old class is removed before creating the updated one. ---
[GitHub] madlib pull request #333: Update version numbers to 1.16-dev
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/333 ---
[GitHub] madlib pull request #333: Update version numbers to 1.16-dev
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/333 Update version numbers to 1.16-dev You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib release/new-version Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/333.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #333 commit 5f4fdce8bf976914d7b929817ca5fbff0f1029ec Author: Orhan Kislal Date: 2018-10-19T17:23:42Z Update version numbers to 1.16-dev ---
[GitHub] madlib issue #332: Update Dockerfile to use ubuntu 16.04
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/332 Please do not merge this PR until we change the version to 1.16-dev. ---
[GitHub] madlib issue #331: Build: Include preflight and postflight scripts for mac
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/331 Good catch +1 ---
[GitHub] madlib issue #329: Release/prep 1.15.1
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/329 Thanks for the comments. @fmcquillan99 Regarding MADLIB-1171. The following commit about AO tables references this JIRA even though they are not related https://github.com/madlib/madlib/commit/3db98babe3326fb5e2cd16d0639a2bef264f4b04. It is very strange because the JIRA activity does not show that commit but it has no trouble catching the mention your comment. ---
[GitHub] madlib pull request #325: Madpack/ic func schema
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/325 ---
[GitHub] madlib pull request #330: Margins: Copy summary table instead of renaming
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/330 Margins: Copy summary table instead of renaming JIRA: MADLIB-1274 Margins summary table gets dropped since its schema remains pg_temp. This commit fixed the issue by copying the contents instead of renaming. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/margins-summary Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/330.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #330 commit 67bf28d9c196b969a925837eab0edda0de814193 Author: Orhan Kislal Date: 2018-09-24T14:34:33Z Margins: Copy summary table instead of renaming JIRA: MADLIB-1274 Margins summary table gets dropped since its schema remains pg_temp. This commit fixed the issue by copying the contents instead of renaming. ---
[GitHub] madlib pull request #329: Release/prep 1.15.1
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/329 Release/prep 1.15.1 You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib release/prep-1.15.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/329.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #329 commit d12a18bea42e914e0f7e4d550317537ce58daca3 Author: Orhan Kislal Date: 2018-09-28T07:21:40Z Build: Change version to 1.15.1 commit 8fb4f162a409e0ecdbd4b80b8ce3ff1bd050b90c Author: Orhan Kislal Date: 2018-09-28T10:30:30Z Update RELEASE_NOTES commit 6a8a3395761cae401b5b4b5bfc36259cc14db648 Author: Orhan Kislal Date: 2018-09-28T13:33:37Z Add 1.15.1 changelist and fix upgrade util. Upgrade was failing when functions without any arguments were added to the changelist. This commit fixes the issue by setting the argument list to empty string. ---
[GitHub] madlib issue #325: Madpack/ic func schema
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/325 Another alternative is to integrate the madlib deployment into the IC/DC. What I mean is similar to how PostGIS runs its unit tests. IC/DC creates a temporary database/schema, deploys the MADlib over there, runs the tests as usual and then removes the temporary database/schema. This will inevitably increase the running time of IC/DC but I believe it will be more stable. Since we assume that the user might be using the madlib deployment schema, it is also possible that they drop and/or recreate UDTs, UDFs and UDAs. Our IC/DC does not account for a case like that and will probably fail. It will also mean that a user can run IC/DC before they deploy it to their target database. I would assume most users are already using a similar workflow (temp database -> deploy MADlib -> run IC -> deploy MADlib on actual target) ---
[GitHub] madlib issue #325: Madpack/ic func schema
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/325 > Also note: madpack is supposed to drop the file-specific schema after executing each file (see function _execute_per_module_install_dev_check_algo). Hence, common table names in independent tests within same module are not supposed to conflict with each other (if you've seen this happen then it requires investigation). Oh, I see. The naming convention is somewhat strange. Under the modules folder, we have a bunch of folders (graph, etc.) but they are not actually modules, the individual sql files are. The madpack code uses the variable `module` for the folder name which further muddles the naming. This means `madlib_installcheck_graph` schema will be created and dropped for each module in graph. We might want to change it to reflect the actual module name. I think the reused name issue might be more widespread than we think. I am pretty sure `abalone` and `houses` datasets are used in multiple modules. I think removing the `DROP TABLE` statements might work as @iyerr3 suggested. I'll keep the PR open for now to keep the conversation open and start working on a different branch. ---
[GitHub] madlib issue #325: Madpack/ic func schema
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/325 I agree this is not a great solution. Casting the operators makes it especially awkward to use. However, we have to consider the following case. If a module has multiple test files like `graph` and if they re-use the same table names like `vertex`, then we have to drop them before re-creating. ---
[GitHub] madlib issue #325: Madpack/ic func schema
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/325 @iyerr3 @jingyimei I'll put the following in the commit message before merging. This commit fixes the following potential issue. 1. User deploys MADlib on the schema `madlib1`. 2. User creates a table named `vertex` in the `madlib1` schema. 3. User runs install-check. 4. The install check creates a new role and a new schema for each module in the database. 5. The install check sets the `search_path` to `madlib_installcheck_, madlib1`. 6. The graph IC calls `DROP TABLE IF EXISTS vertex` and fails because the vertex table does exist but it is not owned by the install-check role. This commit removes the madlib installation schema from the search path so that it only uses its own schema. This means every madlib function call, type and operator has to be called directly using the madlib schema name. One alternative solution is eliminating the `drop table` commands from the tests but that would require a very complicated refactoring work since most of the tests are written to reuse the same output table names. Another alternative is changing the `drop table` and `create table` commands to use the newly created test schema. However, this is very tricky to test; if a developer forgets to put the schema name, the test will still work unless she also creates a table of the same name in the madlib deployment schema. ---
[GitHub] madlib pull request #325: Madpack/ic func schema
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/325 Madpack/ic func schema IC/DC was prone to failure if the user were creating tables in the madlib schema. This commit fixes the potential issue by removing the madlib from the search path and adding the madlib_schema keyword for every function, type and operator that is created by madlib. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib madpack/ic-func-schema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/325.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #325 commit c762817926abc36191cd27d77c4a2ed7b2ec8151 Author: Orhan Kislal Date: 2018-09-27T13:30:41Z IC/DC: Remove madlib schema IC/DC IC/DC was prone to failure if the user were creating tables in the madlib schema. This commit fixes the potential issue by removing the madlib from the search path and adding the madlib_schema keyword for every function, type and operator that is created by madlib. commit dd1389639232dce64e359cef923941103e37f3a6 Author: Orhan Kislal Date: 2018-09-27T13:56:36Z Fix double schema errors commit 325d70c1f9d24b7abb390270d2d2986e86cabba4 Author: Orhan Kislal Date: 2018-09-27T16:18:18Z Revert documentation changes ---
[GitHub] madlib pull request #324: Madpack/ic func schema
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/324 ---
[GitHub] madlib pull request #324: Madpack/ic func schema
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/324 Madpack/ic func schema IC/DC was prone to failure if the user were creating tables in the madlib schema. This commit fixes the potential issue by removing the madlib from the search path and adding the madlib_schema keyword for every function, type and operator that is created by madlib. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib madpack/ic-func-schema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/324.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #324 commit c762817926abc36191cd27d77c4a2ed7b2ec8151 Author: Orhan Kislal Date: 2018-09-27T13:30:41Z IC/DC: Remove madlib schema IC/DC IC/DC was prone to failure if the user were creating tables in the madlib schema. This commit fixes the potential issue by removing the madlib from the search path and adding the madlib_schema keyword for every function, type and operator that is created by madlib. commit dd1389639232dce64e359cef923941103e37f3a6 Author: Orhan Kislal Date: 2018-09-27T13:56:36Z Fix double schema errors ---
[GitHub] madlib pull request #322: Madpack devcheck schema
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/322 ---
[GitHub] madlib pull request #322: Madpack devcheck schema
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/322 Madpack devcheck schema You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib madpack-devcheck-schema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/322.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #322 commit 33e4634b5d11b9ae29de3dece5b2ba96c7257aed Author: Orhan Kislal Date: 2018-09-24T14:33:05Z IC: Add schema for test cases IC/DC was prone to failure if the user were creating tables in the madlib schema. This commit fixes the potential issue by adding the madlib_test_schema in the madpack and test cases. commit 22c3e31560aa50601ea185bdc3da613efd6d40b7 Author: Orhan Kislal Date: 2018-09-24T14:34:33Z Margins: Copy summary table instead of renaming JIRA: MADLIB-1274 Margins summary table gets dropped since its schema remains pg_temp. This commit fixed the issue by copying the contents instead of renaming. ---
[GitHub] madlib issue #321: RF: Increase the dataset size of dev-check test
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/321 I hope so. I tried over 3000 runs with this fix in and did not get a single error. ---
[GitHub] madlib pull request #321: RF: Increase the dataset size of dev-check test
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/321 RF: Increase the dataset size of dev-check test You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib rf-devc-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/321.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #321 commit d7922ef9ba06fe2b25868f261c365307b9d97141 Author: Orhan Kislal Date: 2018-09-21T13:02:19Z RF: Increase the dataset size of dev-check test ---
[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/318#discussion_r217953051 --- Diff: src/madpack/create_changelist.py --- @@ -0,0 +1,239 @@ +#!/usr/bin/python +# -- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# -- + +# Create changelist for any two branches/tags + +# Prequisites: +# The old version has to be installed in the "madlib_old_vers" schema +# The new version has to be installed in the "madlib" (default) schema +# Two branches/tags must exist locally (run 'git fetch' to ensure you have the latest version) +# The current branch does not matter + +# Usage (must be executed in the src/madpack directory): +# python create_changelist.py +# If you are using the master branch, plase make sure to edit the branch/tag in the output file + +# Example (should be equivalent to changelist_1.13_1.14.yaml): +# python create_changelist.py madlib rel/v1.13 rel/v1.14 chtest1.yaml + +import sys +import os + +database = sys.argv[1] +old_vers = sys.argv[2] +new_vers = sys.argv[3] +ch_filename = sys.argv[4] + +if os.path.exists(ch_filename): +print "{0} already exists".format(ch_filename) +raise SystemExit + +err1 = os.system("""psql {0} -l > /dev/null""".format(database)) +if err1 != 0: +print "Database {0} does not exist".format(database) +raise SystemExit + +err1 = os.system("""psql {0} -c "select madlib_old_vers.version()" > /dev/null + """.format(database)) +if err1 != 0: +print "MADlib is not installed in the madlib_old_vers schema. Please refer to the Prequisites." +raise SystemExit + +err1 = os.system("""psql {0} -c "select madlib.version()" > /dev/null + """.format(database)) +if err1 != 0: +print "MADlib is not installed in the madlib schema. Please refer to the Prequisites." +raise SystemExit + +print "Creating changelist {0}".format(ch_filename) +os.system("rm -f /tmp/madlib_tmp_nm.txt /tmp/madlib_tmp_udf.txt /tmp/madlib_tmp_udt.txt") +try: +# Find the new modules using the git diff +err1 = os.system("git diff {old_vers} {new_vers} --name-only --diff-filter=A > /tmp/madlib_tmp_nm.txt".format(**locals())) +if err1 != 0: +print "Git diff failed. Please ensure that branches/tags are fetched." +raise SystemExit + +f = open("/tmp/madlib_tmp_cl.yaml", "w") +f.write( +"""# -- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# -- +""") + +f.write(
[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/318 @kaknikhil I checked the 1.11 -> 1.12 scenario. The tool is missing the`tree_train` and `forest_train` entries. The change seems to be the removal of `surrogate_params` and the addition of `null_handling_params`. Since both of them are `text` type, it does not get picked up by the `diff_udf.sql` script. ---
[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/318 Thanks for the review @kaknikhil . 1. We had an issue with the knn help messages during one of the releases because they didn't show up on the `diff_udf.sql` output. IIRC, the problem was that the same function was converted from a pure SQL function to a plpython function. I don't have a quick solution to identify such cases. 2. This indentation should be OK. 3. We can check if the `new_vers` tag/branch exists and use `master` in its place. Alternatively, we can change the comment to avoid using the new version. It does not have any functional value. ---
[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/318#discussion_r217361584 --- Diff: src/madpack/create_changelist.py --- @@ -0,0 +1,229 @@ +#!/usr/bin/python +# -- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# -- + +# Create changelist for any two branches/tags + +# Prequisites: +# The old version has to be installed in the "madlib_old_vers" schema +# The new version has to be installed in the "madlib" (default) schema +# Two branches/tags must exist locally (run 'git fetch' to ensure you have the latest version) +# The current branch does not matter + +# Usage (must be executed in the src/madpack directory): +# python create_changelist.py + +# Example (should be equivalent to changelist_1.13_1.14.yaml): +# python create_changelist.py madlib rel/v1.13 rel/v1.14 chtest1.yaml + +import sys +import os + +database = sys.argv[1] +old_vers = sys.argv[2] +new_vers = sys.argv[3] +ch_filename = sys.argv[4] + +if os.path.exists(ch_filename): +print "{0} already exists".format(ch_filename) +raise SystemExit + +err1 = os.system("""psql {0} -l > /dev/null""".format(database)) +if err1 != 0: +print "Database {0} does not exist".format(old_vers) +raise SystemExit + +err1 = os.system("""psql {0} -c "select madlib_old_vers.version()" > /dev/null + """.format(database)) +if err1 != 0: +print "Schema {0} does not exist".format(old_vers) +raise SystemExit + +err1 = os.system("""psql {0} -c "select madlib.version()" > /dev/null + """.format(database)) +if err1 != 0: +print "Schema {0} does not exist".format(new_vers) +raise SystemExit + --- End diff -- That would be tricky for branches. `madlib.version()` gives a on a branch (and just a tag on tags). We can use `git rev-parse --short HEAD` to get the commit tag but it seems complicating the code for handholding the developer that uses it. ---
[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/318#discussion_r217358090 --- Diff: src/madpack/diff_udf.sql --- @@ -11,71 +11,13 @@ RETURNS text AS $$ $$ LANGUAGE plpythonu; -CREATE OR REPLACE FUNCTION get_functions(schema_name text) +CREATE OR REPLACE FUNCTION get_functions(table_name text, schema_name text, + type_filter text) RETURNS VOID AS $$ import plpy plpy.execute(""" -CREATE TABLE functions_madlib_new_version AS -SELECT -"schema", "name", filter_schema("retype", 'madlib') retype, -filter_schema("argtypes", 'madlib') argtypes, "type" -FROM -( - -SELECT n.nspname as "schema", --- End diff -- It was duplicated code. I just added another parameter to the function and reused it. ---
[GitHub] madlib issue #318: Madpack: Add a script for automating changelist creation
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/318 Thanks for the comments @kaknikhil. I addressed them and added the support for return type based dependencies. It would be great if you could take another look at the latest version. ---
[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/318#discussion_r216567704 --- Diff: src/madpack/diff_udf.sql --- @@ -142,9 +142,12 @@ DROP TABLE IF EXISTS functions_madlib_new_version; SELECT get_functions('madlib_old_vers'); SELECT +type, --'\t-' || name || ':' || '\n\t\t-rettype: ' || retype || '\n\t\t-argument: ' || argtypes -'- ' || name || ':' || '\nrettype: ' || retype || '\n argument: ' || argtypes AS "Dropped UDFs" -, type +'- ' || name || ':' AS "Dropped UDF part1", --- End diff -- `rettype` was not removed, I just changed the column names to `Dropped UDF part1` format so that I can easily parse them. The newline characters were just complicating the output for no reason. ---
[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/318#discussion_r216566555 --- Diff: src/madpack/create_changelist.py --- @@ -0,0 +1,132 @@ +#!/usr/bin/python +# -- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# -- + +# Create changelist for any two branches + +# Prequisites: +# The old version has to be installed in the "madlib_old_vers" schema +# The new version has to be installed in the "madlib" (default) schema +# Two branches must exist locally (run 'git fetch' to ensure you have the latest version) --- End diff -- They can be branches as well as tags. ---
[GitHub] madlib pull request #318: Madpack: Add a script for automating changelist cr...
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/318 Madpack: Add a script for automating changelist creation You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib madpack/auto-changelist Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/318.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #318 commit 28528b21ecb9b53d03de7683ff7e8db2bb409675 Author: Orhan Kislal Date: 2018-09-04T12:50:50Z Madpack: Add a script for automating changelist creation ---
[GitHub] madlib pull request #311: Vector to columns: added support for splitting arr...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/311#discussion_r210687181 --- Diff: src/ports/postgres/modules/utilities/test/unit_tests/test_transform_vec_cols.py_in --- @@ -125,23 +125,24 @@ class Vec2ColsTestSuite(unittest.TestCase): def test_get_names_for_split_output_cols_feature_names_none(self): self.plpy_mock_execute.return_value = [{"n_x": 3}] -new_cols = self.subject.get_names_for_split_output_cols(self.default_source_table, 'foobar', None) +new_cols = self.subject.get_names_for_split_output_cols(self.default_source_table, 'foobar') self.assertEqual(['f1', 'f2', 'f3'], new_cols) -def test_get_names_for_split_output_cols_feature_names_not_none(self): -self.plpy_mock_execute.return_value = [{"n_x": 3}] -new_cols = self.subject.get_names_for_split_output_cols(self.default_source_table, 'foobar', ['a', 'b', 'c']) -self.assertEqual(['a', 'b', 'c'], new_cols) +# def test_get_names_for_split_output_cols_feature_names_not_none(self): --- End diff -- We should remove these commented lines. ---
[GitHub] madlib pull request #302: Remove online examples
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/302 ---
[GitHub] madlib pull request #306: Ubuntu support
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/306#discussion_r208064138 --- Diff: deploy/DEB/postinst --- @@ -0,0 +1,46 @@ +#!/bin/sh +# +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Source debconf library. +. /usr/share/debconf/confmodule + +MADLIB_VERSION="1.15-dev" +MADLIB_INSTALL_PATH="InstallPathNotFound" + +# Fetching configuration from debconf +db_get madlib/installpath +MADLIB_INSTALL_PATH=$RET + +# Remove existing soft links --- End diff -- It seems this is not as trivial as it seems. CPackDeb takes a few extra files (`postinst, postrm` etc.) but not any arbitrary file. We would have to put the common file into a different folder such as `/usr/share/madlib`, since we have to access this file after madlib is removed. ---
[GitHub] madlib issue #306: Ubuntu support
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/306 Thanks for the comments, Rahul. I am currently testing the new changes, no need to test these temporary commits. ---
[GitHub] madlib pull request #306: Ubuntu support
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/306 Ubuntu support You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib ubuntu-support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/306.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #306 commit be33d4df22e13a80f5d4824d44d0b5d45bd3e892 Author: Nandish Jayaram Date: 2018-07-25T00:27:59Z Ubuntu support for MADlib JIRA: MADLIB-1256 Adds support for compiling on Ubuntu as well as creating a deb package. Co-authored-by: Domino Valdano Co-authored-by: Jingyi Mei Co-authored-by: Orhan Kislal commit 1b9deaf88c82e98a3549adb5eac3ce93bdc485fd Author: Orhan Kislal Date: 2018-08-01T23:47:57Z rename cmakelists commit c6045c63c155fca153f2d245654c1bf2dbb66676 Author: Orhan Kislal Date: 2018-08-02T17:36:43Z Fix licenses ---
[GitHub] madlib pull request #305: Change the version to 1.15 and add changelist
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/305 Change the version to 1.15 and add changelist Co-authored-by: Nandish Jayaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib 1-15-changelist Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/305.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #305 commit 588b1268ed61f6f06181c58f6de9902e0d95d291 Author: Orhan Kislal Date: 2018-08-01T21:49:46Z Change the version to 1.15 and add changelist Co-authored-by: Nandish Jayaram ---
[GitHub] madlib pull request #304: MLP: Add test for iterations_per_step
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/304 ---
[GitHub] madlib pull request #304: MLP: Add test for iterations_per_step
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/304 MLP: Add test for iterations_per_step You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/mlp-grouping-ans Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/304.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #304 commit b87199d540084da3d72e62e410a34e67b8159b45 Author: Orhan Kislal Date: 2018-07-30T22:14:22Z MLP: Add test for iterations_per_step ---
[GitHub] madlib pull request #303: Add test for iterations_per_step
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/303 Add test for iterations_per_step You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/mlp-grouping-ans Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/303.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #303 commit 27fa25ae6ca4c8cc167c475c2f368c53fafabd20 Author: Orhan Kislal Date: 2018-07-30T22:14:22Z Add test for iterations_per_step ---
[GitHub] madlib pull request #303: Add test for iterations_per_step
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/303 ---
[GitHub] madlib pull request #302: Remove online examples
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/302 Remove online examples JIRA: MADLIB-1260 You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib remove-online-examples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/302.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #302 commit 7a857db9afbd32bbb3df8747742a754bf8998ab8 Author: Nandish Jayaram Date: 2018-07-27T23:18:33Z Remove examples from online docs for subset of modules commit 5caf22418e3ac273c0466597a5c8359d6f5ceaec Author: Orhan Kislal Date: 2018-07-28T00:08:04Z Remove on-line examples ---
[GitHub] madlib pull request #291: Feature: Vector-Column Transformations
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/291#discussion_r205914673 --- Diff: src/ports/postgres/modules/utilities/transform_vec_cols.py_in --- @@ -0,0 +1,495 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import plpy +from control import MinWarning +from internal.db_utils import is_col_1d_array +from internal.db_utils import quote_literal +from utilities import _assert +from utilities import add_postfix +from utilities import ANY_ARRAY +from utilities import is_valid_psql_type +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from validate_args import is_var_valid +from validate_args import explicit_bool_to_text +from validate_args import get_cols +from validate_args import get_cols_and_types +from validate_args import get_expr_type +from validate_args import input_tbl_valid +from validate_args import output_tbl_valid +from validate_args import table_exists + +class vec_cols_helper: +def __init__(self): +self.all_cols = None + +def get_cols_as_list(self, cols_to_process, source_table=None, exclude_cols=None): +""" +Get a list of columns based on the value of cols_to_process +Args: +@param cols_to_process: str, Either a * or a comma-separated list of col names +@param source_table: str, optional. Source table name +@param exclude_cols: str, optional. Comma-separated list of the col(s) to exclude + from the source table, only used if cols_to_process is * +Returns: +A list of column names (or an empty list) +""" +# If cols_to_process is empty/None, return empty list +if not cols_to_process: +return [] +if cols_to_process.strip() != "*": +# If cols_to_process is a comma separated list of names, return list +# of column names in cols_to_process. +return [col for col in split_quoted_delimited_str(cols_to_process) +if col not in split_quoted_delimited_str(exclude_cols)] +if source_table: +if not self.all_cols: +self.all_cols = get_cols(source_table) +return [col for col in self.all_cols +if col not in split_quoted_delimited_str(exclude_cols)] +return [] + +class vec2cols: +def __init__(self): +self.get_cols_helper = vec_cols_helper() +self.module_name = self.__class__.__name__ + +def validate_args(self, source_table, output_table, vector_col, feature_names, + cols_to_output): +""" +Validate args for vec2cols +""" +input_tbl_valid(source_table, self.module_name) +output_tbl_valid(output_table, self.module_name) +is_var_valid(source_table, cols_to_output) +is_var_valid(source_table, vector_col) +_assert(is_valid_psql_type(get_expr_type(vector_col, source_table), ANY_ARRAY), +"{0}: vector_col should refer to an array.".format(self.module_name)) +_assert(is_col_1d_array(source_table, vector_col), +"{0}: vector_col must be a 1-dimensional array.".format(self.module_name)) + +def get_names_for_split_output_cols(self, source_table, vector_col, feature_names): +""" +Get list of names for the newly-split columns to include in the +output table. +Args: +@param: source_table, str. Source table +@param: vector_col, str. Column name containing the array input +@param: feature_names, list. Python list of the feat
[GitHub] madlib pull request #291: Feature: Vector-Column Transformations
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/291#discussion_r205914522 --- Diff: src/ports/postgres/modules/utilities/validate_args.py_in --- @@ -513,11 +513,12 @@ def array_col_has_same_dimension(tbl, col): # -def explicit_bool_to_text(tbl, cols, schema_madlib): +def explicit_bool_to_text(tbl, cols, schema_madlib, is_forced=False): --- End diff -- w/ @ArvindSridhar On platforms that has bool to text casting (gpdb5, pg 9.6, pg 10), we still need this to make sure we can create an array of bool and text types. ---
[GitHub] madlib pull request #:
Github user orhankislal commented on the pull request: https://github.com/apache/madlib/commit/1fe308c70d2c91fef508d29d81ed0e93da429eb6#commitcomment-29832298 In src/madpack/madpack.py: In src/madpack/madpack.py on line 973: Let's say the user has 1.13 installed on a schema, uses rpm to get 1.14 but doesn't run the `madpack upgrade` command and tries the install-check. That will be caught in here and `Versions do not match` is accurate. If we want to check for schema not existing, that is a separate if check. I'll make a new commit to catch it. ---
[GitHub] madlib pull request #297: Madpack: Fix various schema related bugs
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/297 Madpack: Fix various schema related bugs You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib bugfix/madpack-extra Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/297.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #297 commit 31bfd8d3913576472dcb65feca5cb2dda01fa458 Author: Orhan Kislal Date: 2018-07-20T22:36:11Z Madpack: Fix various schema related bugs ---
[GitHub] madlib pull request #289: RF: Add impurity variable importance
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/289 RF: Add impurity variable importance JIRA: MADLIB-1205 This commit makes the following changes: - Add impurity variable importance for random forests. - Rename current cat_var_importance and con_var_importance measurements to oob_cat_var_importance and oob_con_var_importance. New impurity measurement is provided as impurity_var_importance, and supports grouping. It combines the importance values for both categorical and continuous features into a single array. Co-authored-by: Rahul Iyer Co-authored-by: Jingyi Mei Co-authored-by: Arvind Sridhar Co-authored-by: Nandish Jayaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib rf_gini_importance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #289 commit 622d46a85f4264fdc94bd41dc66a23f1aa2c3ed6 Author: Rahul Iyer Date: 2018-07-10T00:34:33Z RF: Add impurity variable importance JIRA: MADLIB-1205 This commit makes the following changes: - Add impurity variable importance for random forests. - Rename current cat_var_importance and con_var_importance measurements to oob_cat_var_importance and oob_con_var_importance. New impurity measurement is provided as impurity_var_importance, and supports grouping. It combines the importance values for both categorical and continuous features into a single array. Co-authored-by: Rahul Iyer Co-authored-by: Jingyi Mei Co-authored-by: Arvind Sridhar Co-authored-by: Nandish Jayaram ---
[GitHub] madlib pull request #286: Build: Remove symlinks during rpm uninstall
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/286#discussion_r200793992 --- Diff: pom.xml --- @@ -66,6 +66,7 @@ deploy/preflight.sh deploy/RPM/CMakeLists.txt deploy/rpm_post.sh + deploy/rpm_post_uninstall.sh --- End diff -- Since this is a new file, we should add the apache license instead of adding it to the exclude list. ---
[GitHub] madlib pull request #276: Feature/dev check
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/276 ---
[GitHub] madlib pull request #276: Feature/dev check
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/276 Feature/dev check You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/dev-check Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/276.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #276 commit 50a35b2e5799a5ddf6f4d23a7bca4566dbd1d05b Author: Nandish Jayaram Date: 2018-06-06T23:42:56Z Madpack: Add dev-check and a compact install-check. - The current install check is expensive since it runs various hyper param permutations for all MADlib modules. This commits moves all of those tests to dev-check, which can be used by developers for iterating faster. We have now created watered down install-check for each module, which just runs one hyper-param combination for each MADlib function, and does not do any asserts. - This commit also includes changes in madpack to add a new madpack option for dev-check. TODO: - complete trimming install check for all modules. - update documentation for developer consumption. Co-authored-by: Arvind Sridhar commit d6c7834f73d00d0bd3ddf84af524ba08725a6244 Author: Arvind Sridhar Date: 2018-06-07T21:51:47Z Install-Check: Add new IC files for the lightweight testing Co-authored-by: Orhan Kislal ---
[GitHub] madlib pull request #271: Madpack: Make install, reinstall and upgrade atomi...
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/271 Madpack: Make install, reinstall and upgrade atomic We now write all the necessary sql into one file, and run it once in a single session. The database's rollback will be useful to bring it back to original state in case of a failure. Co-authored-by: Rahul Iyer Co-authored-by: Orhan Kislal You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/atomic_upgrade Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/271.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #271 commit 693f6ae59dbb65d56fe4ff95bac8fde198f3d04f Author: Nandish Jayaram Date: 2018-05-18T23:13:28Z Madpack: Make install, reinstall and upgrade atomic We now write all the necessary sql into one file, and run it once in a single session. The database's rollback will be useful to bring it back to original state in case of a failure. Co-authored-by: Rahul Iyer Co-authored-by: Orhan Kislal ---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/269 ---
[GitHub] madlib issue #269: Statistics: Add grouping support for correlation function...
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/269 Thanks for your comments Frank. Regarding the accuracy, I have tested the code with multiple groups from multiple grouping columns with hand-calculated values (as seen in the install-check). I have also tried a larger dataset (600 columns). I have duplicated the data to create multiple columns and checked to see if the results match across groups. Please let us know if you have any other suggestions. ---
[GitHub] madlib pull request #269: Statistics: Add grouping support for correlation f...
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/269 Statistics: Add grouping support for correlation functions JIRA: MADLIB-1128 This commit adds grouping support to correlation and covariance functions in MADlib stats. Changes include relevant queries to do the same. This commit also has refactor changes to a helper function in utilities.py_in. Co-authored-by: Jingyi Mei Co-authored-by: Nikhil Kak Co-authored-by: Nandish Jayaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/correlation-grouping Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/269.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #269 commit e9dd9ae88d9d2acaea68c093396f6d600148ede4 Author: Orhan Kislal Date: 2018-05-07T23:53:51Z Statistics: Add grouping support for correlation functions JIRA: MADLIB-1128 This commit adds grouping support to correlation and covariance functions in MADlib stats. Changes include relevant queries to do the same. This commit also has refactor changes to a helper function in utilities.py_in. Co-authored-by: Jingyi Mei Co-authored-by: Nikhil Kak Co-authored-by: Nandish Jayaram ---
[GitHub] madlib pull request #266: Release 1.14: Version numbering and upgrade relate...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/266#discussion_r183492733 --- Diff: src/madpack/changelist_1.13_1.14.yaml --- @@ -0,0 +1,97 @@ +# -- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# -- + +# Changelist for MADlib version 1.13 to 1.14 + +# This file contains all changes that were introduced in a new version of +# MADlib. This changelist is used by the upgrade script to detect what objects +# should be upgraded (while retaining all other objects from the previous version) + +# New modules (actually .sql_in files) added in upgrade version +# For these files the sql_in code is retained as is with the functions in the +# file installed on the upgrade version. All other files (that don't have +# updates), are cleaned up to remove object replacements +new module: +# - Changes from 1.13 to 1.14 --- End diff -- Added the missing modules. ---
[GitHub] madlib issue #266: Release 1.14: Version numbering and upgrade related chang...
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/266 Thanks for the comments Rahul. ---
[GitHub] madlib pull request #266: Release 1.14: Version numbering and upgrade relate...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/266#discussion_r183180223 --- Diff: src/madpack/upgrade_util.py --- @@ -82,16 +82,21 @@ def _get_function_info(self, oid): proname, textin(regtypeout(prorettype::regtype)) AS rettype, CASE array_upper(proargtypes,1) WHEN -1 THEN '' -ELSE textin(regtypeout(unnest(proargtypes)::regtype)) +ELSE textin(regtypeout(foo)) END AS argtype, CASE WHEN proargnames IS NULL THEN '' -ELSE unnest(proargnames) +ELSE bar END AS argname, CASE array_upper(proargtypes,1) WHEN -1 THEN 1 -ELSE generate_series(0, array_upper(proargtypes, 1)) +ELSE zee END AS i FROM -pg_proc AS p +(SELECT *, oid, +unnest(proargtypes)::regtype AS foo, --- End diff -- We changed the query because PG 10 does not allow set returning functions in the `CASE` clause any more. ---
[GitHub] madlib pull request #266: Release 1.14: Version numbering and upgrade relate...
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/266 Release 1.14: Version numbering and upgrade related changes Updates the version number to 1.14 for the release candidate. Updates the changelists and other related files for upgrade. Note that upgrade is not supported from versions prior to 1.11. Co-authored-by: Nikhil Kak You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib rel/upgrade_v114 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/266.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #266 commit c65351f29d7944b82e189489dad74128f8afe69f Author: Orhan Kislal Date: 2018-04-19T17:23:47Z Release 1.14: Version numbering and upgrade related changes Updates the version number to 1.14 for the release candidate. Updates the changelists and other related files for upgrade. Note that upgrade is not supported from versions prior to 1.11. Co-authored-by: Nikhil Kak ---
[GitHub] madlib pull request #261: MLP: Check for 1-hot encoding of dependent variabl...
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/261 MLP: Check for 1-hot encoding of dependent variable for minibatch This commit adds a check to make sure that the dependent variable for mlp minibatch is one hot encoded. This only validates that the dependent variable array has more than 1 value. Co-authored-by: Orhan Kislal You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib feature/mlp-encoded-dep-minibatch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/261.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #261 commit b101914cea0688f2969dd6bb2823cd340b02b243 Author: Nikhil Kak Date: 2018-04-10T23:40:49Z MLP: Check for 1-hot encoding of dependent variable for minibatch This commit adds a check to make sure that the dependent variable for mlp minibatch is one hot encoded. This only validates that the dependent variable array has more than 1 value. Co-authored-by: Orhan Kislal ---
[GitHub] madlib pull request #229: SVM: Add minibatch as a new solver
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/229#discussion_r163690557 --- Diff: src/modules/convex/linear_svm_igd.cpp --- @@ -120,6 +124,98 @@ linear_svm_igd_transition::run(AnyType &args) { return state; } +/** + * @brief Perform the linear support vector machine transition step + * + * Called for each tuple. + */ +AnyType +linear_svm_igd_minibatch_transition::run(AnyType &args) { +// The real state. +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +SVMMinibatchState > state = args[0]; + +// initialize the state if first tuple +if (state.algo.numRows == 0) { + +LinearSVM::epsilon = args[9].getAs();; +LinearSVM::is_svc = args[10].getAs();; +if (!args[3].isNull()) { +SVMMinibatchState > previousState = args[3]; +state.allocate(*this, previousState.task.nFeatures); +state = previousState; +} else { +// configuration parameters +uint32_t dimension = args[4].getAs(); +state.allocate(*this, dimension); // with zeros +} +// resetting in either case +// state.reset(); +state.task.stepsize = args[5].getAs(); +const double lambda = args[6].getAs(); +const bool isL2 = args[7].getAs(); +const int nTuples = args[8].getAs(); + +// The regularization operations called below (scaling and clipping) +// need these class variables to be set. +L1::n_tuples = nTuples; +L2::n_tuples = nTuples; +if (isL2) +L2::lambda = lambda; +else +L1::lambda = lambda; +} + +state.algo.nEpochs = args[12].getAs(); +state.algo.batchSize = args[13].getAs(); + +// Skip the current record if args[1] (features) contains NULL values, +// or args[2] is NULL +try { +args[1].getAs(); +} catch (const ArrayWithNullException &e) { +return args[0]; +} +if (args[2].isNull()) +return args[0]; + +// tuple +using madlib::dbal::eigen_integration::MappedColumnVector; + +MappedMatrix x(NULL); +MappedColumnVector y(NULL); +try { +new (&x) MappedMatrix(args[1].getAs()); +new (&y) MappedColumnVector(args[2].getAs()); +} catch (const ArrayWithNullException &e) { +return args[0]; +} +SVMMiniBatchTuple tuple; +tuple.indVar = trans(x); +tuple.depVar = y; + +// each tuple can be weighted - this can be combination of the sample weight +// and the class weight. Calling function is responsible for combining the two +// into a single tuple weight. The default value for this parameter is 1, set +// into the definition of "tuple". +// The weight is used to increase the value of a particular tuple for the online +// learning. The weight is not used for the loss computation. +tuple.weight = args[11].getAs(); + + +// Now do the transition step +// apply Minibatching with regularization +L2::scaling(state.task.model, state.task.stepsize); +LinearSVMIGDAlgoMiniBatch::transitionInMiniBatch(state, tuple); +L1::clipping(state.task.model, state.task.stepsize); + --- End diff -- Should we leave a comment on why the mini-batching transition step does not call the loss and gradient algorithms like the regular one? On the other hand, I am not sure if we want to explain the lack of something in the comments. Maybe we can mention this implementation detail in the design docs? ---
[GitHub] madlib pull request #229: SVM: Add minibatch as a new solver
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/229#discussion_r163120094 --- Diff: src/modules/convex/linear_svm_igd.cpp --- @@ -120,6 +124,100 @@ linear_svm_igd_transition::run(AnyType &args) { return state; } +/** + * @brief Perform the linear support vector machine transition step + * + * Called for each tuple. + */ +AnyType +linear_svm_igd_minibatch_transition::run(AnyType &args) { +// The real state. +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +SVMMinibatchState > state = args[0]; + +// initialize the state if first tuple +if (state.algo.numRows == 0) { + +LinearSVM::epsilon = args[9].getAs();; +LinearSVM::is_svc = args[10].getAs();; +if (!args[3].isNull()) { +SVMMinibatchState > previousState = args[3]; +state.allocate(*this, previousState.task.nFeatures); +state = previousState; +} else { +// configuration parameters +uint32_t dimension = args[4].getAs(); +state.allocate(*this, dimension); // with zeros +} +// resetting in either case +// state.reset(); --- End diff -- We should remove these lines if we don't need them. ---
[GitHub] madlib pull request #229: SVM: Add minibatch as a new solver
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/229#discussion_r163689232 --- Diff: src/ports/postgres/modules/svm/svm.py_in --- @@ -89,9 +113,9 @@ def _verify_table(source_table, model_table, dependent_varname, "('{dependent_varname}') for source_table " "({source_table})!".format(dependent_varname=dependent_varname, source_table=source_table)) -dep_type = get_expr_type(dependent_varname, source_table) -if '[]' in dep_type: -plpy.error("SVM error: dependent_varname cannot be of array type!") +# dep_type = get_expr_type(dependent_varname, source_table) --- End diff -- We should remove these lines if we don't need them. ---
[GitHub] madlib issue #228: Add centos 7 postgres 9.6/10 docker files for automated t...
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/228 @njayaram2 Please review at your earliest convenience. ---
[GitHub] madlib issue #227: Add docker file for postgres 9.6 and 10
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/227 Created a new pull request (#228). Closing this one. ---
[GitHub] madlib pull request #227: Add docker file for postgres 9.6 and 10
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/227 ---
[GitHub] madlib pull request #228: Add centos 7 postgres 9.6/10 docker files for auto...
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/228 Add centos 7 postgres 9.6/10 docker files for automated testing. Additional Author : Nikhil Kak Also added a readme to describe all the docker files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/madlib docker-images Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #228 commit df69f2626e1c77e0f65e2bec76f9704dfc54e2bf Author: Nikhil Kak and Orhan Kislal Date: 2018-01-18T22:42:22Z Add centos 7 postgres 9.6/10 docker files for automated testing. Additional Author : Nikhil Kak Also added a readme to describe all the docker files. ---
[GitHub] madlib pull request #227: Add docker file for postgres 9.6 and 10
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/227 Add docker file for postgres 9.6 and 10 - install plpython support for postgres - add dockerfile for postgres 10 centos 7 - add postgres bin dir to $PATH for both 9.6 and 10 - remove unnecessary files You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/madlib centos_postgres_docker Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/227.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #227 commit 81a634ad16f53f1de2afe2f768bcb059493c8313 Author: Nikhil Kak Date: 2017-11-09T22:54:01Z Add docker file for postgres 9.6 and 10 - install plpython support for postgres - add dockerfile for postgres 10 centos 7 - add postgres bin dir to $PATH for both 9.6 and 10 - remove unnecessary files ---
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161864354 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161865238 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161863965 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161850906 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161297926 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161299042 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161296957 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. --- End diff -- is -> if ? ---
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161297074 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_sizes: +""" +Compute sample sizes based on +comman-delimited list of class_sizes --- End diff -- comman -> comma ---
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161300298 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #223: Balance datasets : re-sampling technique
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/223#discussion_r161845440 --- Diff: src/ports/postgres/modules/sample/balance_sample.py_in --- @@ -0,0 +1,994 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file EXCEPT in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import math +import plpy +import re +from collections import defaultdict +from fractions import Fraction +from utilities.control import MinWarning +from utilities.utilities import _assert +from utilities.utilities import unique_string +from utilities.validate_args import table_exists +from utilities.validate_args import columns_exist_in_table +from utilities.validate_args import table_is_empty +from utilities.validate_args import get_cols +from utilities.utilities import py_list_to_sql_string + + +m4_changequote(`') + +def balance_sample(schema_madlib, source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, **kwargs): + +""" +Balance sampling function +Args: +@param source_table Input table name. +@param output_table Output table name. +@param class_col Name of the column containing the class to be + balanced. +@param class_size Parameter to define the size of the different + class values. +@param output_table_size Desired size of the output data set. +@param grouping_cols The columns columns that defines the grouping. +@param with_replacement The sampling method. + +""" +with MinWarning("warning"): + +class_counts = unique_string(desp='class_counts') +desired_sample_per_class = unique_string(desp='desired_sample_per_class') +desired_counts = unique_string(desp='desired_counts') + +if not class_sizes or class_sizes.strip().lower() in ('null', ''): +class_sizes = 'uniform' + +_validate_strs(source_table, output_table, class_col, class_sizes, +output_table_size, grouping_cols, with_replacement) + +source_table_columns = ','.join(get_cols(source_table)) +grp_by = "GROUP BY {0}".format(class_col) + +_create_frequency_distribution(class_counts, source_table, class_col) +temp_views = [class_counts] + +if class_sizes.lower() == 'undersample' and not with_replacement: +""" +Random undersample without replacement. +Randomly order the rows and give a unique (per class) +identifier to each one. +Select rows that have identifiers under the target limit. +""" +_undersampling_with_no_replacement(source_table, output_table, class_col, +class_sizes, output_table_size, grouping_cols, with_replacement, +class_counts, source_table_columns) + +_delete_temp_views(temp_views) +return + +""" +Create views for true and desired sample sizes of classes +""" +""" +include_unsampled_classes tracks is unsampled classes are desired or not. +include_unsampled_classes is always true in output_table_size Null cases but changes given values of desired sample class sizes in comma-delimited classsize paramter. +""" +include_unsampled_classes = True +sampling_with_comma_delimited_class_sizes = class_sizes.find(':') > 0 + +if sampling_with_comma_delimited_class_s
[GitHub] madlib pull request #224: 1.13 Upgrade and MLP IC fix
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/224 1.13 Upgrade and MLP IC fix JIRA: MADLIB-1197 Additional Author: Nandish Jayaram - 1.13 Upgrade does not drop the kNN help functions even though their return types are changed. This commit adds the missing functions to the changelist and alters the upgrade_util.py_in so that functions without arguments can be dropped. - Some assert thresholds are too strict for MLP in IC. This commit relaxes those thresholds. Closes #224 You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/madlib bugfix/mlp_and_upgrade Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/224.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #224 commit 0fdf136b4e0ad0fd2c54bab2144045b11ba5884b Author: Orhan Kislal Date: 2018-01-12T01:22:30Z 1.13 Upgrade and MLP IC fix JIRA: MADLIB-1197 Additional Author: Nandish Jayaram - 1.13 Upgrade does not drop the kNN help functions even though their return types are changed. This commit adds the missing functions to the changelist and alters the upgrade_util.py_in so that functions without arguments can be dropped. - Some assert thresholds are too strict for MLP in IC. This commit relaxes those thresholds. Closes #224 ---
[GitHub] madlib issue #219: Multiple: Hard-wire values for construct_array calls
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/219 LGTM +1 ---
[GitHub] madlib pull request #220: Add more stats to summary function
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/220#discussion_r158568809 --- Diff: src/ports/postgres/modules/summary/Summarizer.py_in --- @@ -199,6 +200,22 @@ class Summarizer: args['max_columns'] = ','.join([minmax_type('max', c) for c in cols]) args['ntile_columns'] = "array_to_string(array[NULL], ',')" + +args['positive_columns'] = ','.join(["sum(case when {0} > 0 \ + then 1 else 0 end)".format(c['attname']) + if c['typname'] in numeric_types + else 'NULL' for c in cols]) + +args["negative_columns"] = ','.join(["sum(case when {0} < 0 \ + then 1 else 0 end)".format(c['attname']) + if c['typname'] in numeric_types + else 'NULL' for c in cols]) + +args["zero_columns"] = ','.join(["sum(case when {0} = 0 \ --- End diff -- In graph algorithms such as SSSP and APSP, we used `EPSILON = 0.01` for float comparisons. ---
[GitHub] madlib issue #216: Release: Upgrade to v1.13
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/216 Tested src and binary upgrades with success and fail scenarios. LGTM +1 ---
[GitHub] madlib pull request #213: KNN: Move online help to python layer
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/213 ---
[GitHub] madlib pull request #213: KNN: Move online help to python layer
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/213 KNN: Move online help to python layer Additional Author: Nikhil Kak - Remove the dependency on the client message level for knn online help. You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/madlib knn_help Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #213 commit a0b1e0a78ffc993f2e2efad8df9a2c49cfc0fcbb Author: Orhan Kislal Date: 2017-12-11T23:27:09Z KNN: Move online help to python layer Additional Author: Nikhil Kak - Remove the dependency on the client message level for knn online help. ---
[GitHub] madlib issue #206: Feature: Allow NULL in rows for computing correlations an...
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/206 Thank you @iyerr3 and @fmcquillan99 for your comments. ---
[GitHub] madlib pull request #194: Logregr: Add input validation for dep/indep variab...
Github user orhankislal closed the pull request at: https://github.com/apache/madlib/pull/194 ---
[GitHub] madlib issue #194: Logregr: Add input validation for dep/indep variables
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/194 I have decided to close this pull request since the default error given by the database is more descriptive. ---
[GitHub] madlib issue #200: Madpack: Move unit tests + refactor minor code
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/200 Tested reinstall as well successful&unsuccessful upgrade on postgres. LGTM +1 ---
[GitHub] madlib pull request #194: Logregr: Add input validation for dep/indep variab...
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/194#discussion_r150943944 --- Diff: src/ports/postgres/modules/regress/logistic.py_in --- @@ -158,12 +159,14 @@ def __logregr_validate_args(schema_madlib, tbl_source, tbl_output, dep_col, if not dep_col or dep_col.strip().lower() in ('null', ''): plpy.error("Logregr error: Invalid dependent column name!") -# if not columns_exist_in_table(tbl_source, [dep_col]): -# plpy.error("Logregr error: Dependent column does not exist!") +if not is_var_valid(tbl_source, dep_col): +plpy.error("Logregr error: Dependent variable is not valid!") --- End diff -- Since the variable can be an expression and not a column, I wanted to avoid printing the whole expression and making the error message long and confusing. We can easily add it if you feel that would be more useful. ---
[GitHub] madlib issue #197: Fix madlib version parsing for upgrade
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/197 jenkins ok to test ---
[GitHub] madlib issue #199: Bugfix: Hard coded schema name in WCC install check
Github user orhankislal commented on the issue: https://github.com/apache/madlib/pull/199 LGTM ---
[GitHub] madlib pull request #197: Fix madlib version parsing for upgrade
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/197#discussion_r150687658 --- Diff: src/madpack/upgrade_util.py --- @@ -142,11 +142,11 @@ def _load(self): """ # _mad_dbrev = 1.9.1 -if self._mad_dbrev.split('.') < '1.10.0'.split('.'): +if map(int,self._mad_dbrev.split('.')) < map(int,'1.10.0'.split('.')): --- End diff -- I was thinking of the first option as you suggested. I am not sure about your second suggestion. I tried all 4 of the combinations and couldn't get it working. ---
[GitHub] madlib pull request #197: Fix madlib version parsing for upgrade
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/197#discussion_r150681944 --- Diff: src/madpack/upgrade_util.py --- @@ -142,11 +142,11 @@ def _load(self): """ # _mad_dbrev = 1.9.1 -if self._mad_dbrev.split('.') < '1.10.0'.split('.'): +if map(int,self._mad_dbrev.split('.')) < map(int,'1.10.0'.split('.')): --- End diff -- Importing those files from madpack.py creates a dependency circle. ---
[GitHub] madlib pull request #198: PMML: Update the pyxb version number to 1.2.6
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/198 PMML: Update the pyxb version number to 1.2.6 You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/madlib pyxb_version Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/198.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #198 ---
[GitHub] madlib pull request #197: Fix madlib version parsing for upgrade
GitHub user orhankislal opened a pull request: https://github.com/apache/madlib/pull/197 Fix madlib version parsing for upgrade You can merge this pull request into a Git repository by running: $ git pull https://github.com/orhankislal/madlib upgrade Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/197.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #197 commit 7cd15c79dab8479c1a1cace18506b02f3f1ddf43 Author: Orhan Kislal Date: 2017-11-03T00:08:49Z Fix madlib version parsing for upgrade ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r150629090 --- Diff: src/ports/postgres/modules/graph/graph_utils.py_in --- @@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, return None +def validate_params_for_centrality_measures(schema_madlib, func_name, --- End diff -- This function name is a bit confusing since it isn't used by the centrality measures functions from `measures.py_in` ---